Tuesday, April 26, 2016

Apache Storm : What is Apache Storm ?

Apache Storm is a free and open source, distributed real-time computation system for processing fast, large streams of data. Storm adds reliable real-time data processing capabilities to Apache Hadoop 2.x. Its effective stream processing capabilities are trusted by Twitter and Yahoo for quickly extracting insights from their Big Data.




WHAT IS APACHE STORM?

Apache Storm is an open source engine which can process data in realtime using its distributed architecture. Storm is simple and flexible. It can be used with any programming language of your choice.
Let’s look at the various components of a Storm Cluster:
  1. Nimbus node. The master node (Similar to JobTracker)
  2. Supervisor nodes. Starts/stops workers & communicates with Nimbus through Zookeeper
  3. ZooKeeper nodes. Coordinates the Storm cluster
Storm Architecture
Architechture: Nimbus, Zookeeper, Supervisor
Here are a few terminologies and concepts you should get familiar with before we go hands-on:
  • Tuples. An ordered list of elements. For example, a “4-tuple” might be (7, 1, 3, 7)
  • Streams. An unbounded sequence of tuples.
  • Spouts. Sources of streams in a computation (e.g. a Twitter API)
  • Bolts. Process input streams and produce output streams. They can:
    • Run functions;
    • Filter, aggregate, or join data;
    • Talk to databases.
  • Topologies. The overall calculation, represented visually as a network of spouts and bolts
Storm Basic Concepts
Basic Concepts Map: Topologies process data when it comes streaming in from the spout, the bolt processes it and the results are passed into Hadoop.

INSTALLATION AND SETUP VERIFICATION:

STEP 1: CHECK STORM SERVICE IS RUNNING

Let’s check if the sandbox has storm processes up and running by login into Ambari and look for Storm in the services listed:

STEP 2: DOWNLOAD THE STORM TOPOLOGY JAR FILE

Now let’s look at a Streaming use case using Storm’s Spouts and Bolts processes. For this we will be using a simple use case, however it should give you the real life experience of running and operating on Hadoop Streaming data using this topology.
Let’s get the jar file which is available in the Storm Starter kit. This has other examples as well, but let’s use the WordCount operation and see how to turn it ON. We will also track this in Storm UI.
wget http://public-repo-1.hortonworks.com/HDP-LABS/Projects/Storm/0.9.0.1/storm-starter-0.0.1-storm-0.9.0.1.jar
enter image description here

STEP 3: CHECK CLASSES AVAILABLE IN JAR

In the Storm example Topology, we will be using three main parts or processes:
  1. Sentence Generator Spout
  2. Sentence Split Bolt
  3. WordCount Bolt
You can check the classes available in the jar as follows:
jar -xvf storm-starter-0.0.1-storm-0.9.0.1.jar | grep Sentence  
jar -xvf storm-starter-0.0.1-storm-0.9.0.1.jar | grep Split  
jar -xvf storm-starter-0.0.1-storm-0.9.0.1.jar | grep WordCount
enter image description here

STEP 4: RUN WORD COUNT TOPOLOGY

Let’s run the storm job. It has a Spout job to generate random sentences while the bolt counts the different words. There is a split Bolt Process along with the Wordcount Bolt Class.
Let’s run the Storm Jar file.
[root@sandbox ~]# storm jar storm-starter-0.0.1-storm-0.9.0.1.jar storm.starter.WordCountTopology WordCount -c storm.starter.WordCountTopology WordCount -c nimbus.host=sandbox.hortonworks.com
Note: For Sandbox versions without Storm preinstalled, navigate to/usr/lib/storm/bin/ directory to run the command above.
enter image description here

STEP 5: OPEN STORM UI

Let’s use Storm UI and look at it graphically:
enter image description here
You should notice the Storm Topology, WordCount in the Topology summary.

STEP 6: CLICK ON WORDCOUNT TOPOLOGY

The topology is located Under Topology Summary. You will see the following:
enter image description here
Click on count.
enter image description here
Click on any port and you will be able to view the results.
enter image description here
You just processed streaming data using Apache Storm. Congratulations on completing the Tutorial!

APPENDIX A: VIEW STORM LOG FILES

Lastly but most importantly, you can always look at the log files. These logs are extremely useful for debugging or status finding. Their directory location:
[root@sandbox ~]# cd /var/log/storm

[root@sandbox storm]# ls -ltr
enter image description here

APPENDIX B: INSTALL MAVEN AND GET STARTED WITH STORM STARTER KIT

INSTALL MAVEN

Download and install Apache Maven as shown in the commands below
curl -o /etc/yum.repos.d/epel-apache-maven.repo https://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo
yum -y install apache-maven
mvn -version
enter image description here

GET STARTED WITH STORM STARTER KIT

Download the Storm Starter Kit and try other topology examples, such as ExclamationTopology and ReachTopology.
git clone git://github.com/apache/storm.git && cd storm/examples/storm-starter

1 comment:

  1. Great post! A well-written resource to anyone looking to boost their Apache Storm through blog commenting. The tools mentioned will also go a long way in making the entire process much more efficient and effective.

    ReplyDelete