Tech blog: Apache Storm : What is Apache Storm ?

Tuesday, April 26, 2016

Apache Storm : What is Apache Storm ?

Apache Storm is a free and open source, distributed real-time computation system for processing fast, large streams of data. Storm adds reliable real-time data processing capabilities to Apache Hadoop 2.x. Its effective stream processing capabilities are trusted by Twitter and Yahoo for quickly extracting insights from their Big Data.

WHAT IS APACHE STORM?

Apache Storm is an open source engine which can process data in realtime using its distributed architecture. Storm is simple and flexible. It can be used with any programming language of your choice.

Let’s look at the various components of a Storm Cluster:

Nimbus node. The master node (Similar to JobTracker)
Supervisor nodes. Starts/stops workers & communicates with Nimbus through Zookeeper
ZooKeeper nodes. Coordinates the Storm cluster

Architechture: Nimbus, Zookeeper, Supervisor

Here are a few terminologies and concepts you should get familiar with before we go hands-on:

Tuples. An ordered list of elements. For example, a “4-tuple” might be (7, 1, 3, 7)
Streams. An unbounded sequence of tuples.
Spouts. Sources of streams in a computation (e.g. a Twitter API)
Bolts. Process input streams and produce output streams. They can:
- Run functions;
- Filter, aggregate, or join data;
- Talk to databases.
Topologies. The overall calculation, represented visually as a network of spouts and bolts

Basic Concepts Map: Topologies process data when it comes streaming in from the spout, the bolt processes it and the results are passed into Hadoop.

INSTALLATION AND SETUP VERIFICATION:

STEP 1: CHECK STORM SERVICE IS RUNNING

Let’s check if the sandbox has storm processes up and running by login into Ambari and look for Storm in the services listed:

STEP 2: DOWNLOAD THE STORM TOPOLOGY JAR FILE

Now let’s look at a Streaming use case using Storm’s Spouts and Bolts processes. For this we will be using a simple use case, however it should give you the real life experience of running and operating on Hadoop Streaming data using this topology.

Let’s get the jar file which is available in the Storm Starter kit. This has other examples as well, but let’s use the WordCount operation and see how to turn it ON. We will also track this in Storm UI.

wget http://public-repo-1.hortonworks.com/HDP-LABS/Projects/Storm/0.9.0.1/storm-starter-0.0.1-storm-0.9.0.1.jar

STEP 3: CHECK CLASSES AVAILABLE IN JAR

In the Storm example Topology, we will be using three main parts or processes:

Sentence Generator Spout
Sentence Split Bolt
WordCount Bolt

You can check the classes available in the jar as follows:

jar -xvf storm-starter-0.0.1-storm-0.9.0.1.jar | grep Sentence  
jar -xvf storm-starter-0.0.1-storm-0.9.0.1.jar | grep Split  
jar -xvf storm-starter-0.0.1-storm-0.9.0.1.jar | grep WordCount

STEP 4: RUN WORD COUNT TOPOLOGY

Let’s run the storm job. It has a Spout job to generate random sentences while the bolt counts the different words. There is a split Bolt Process along with the Wordcount Bolt Class.

Let’s run the Storm Jar file.

[root@sandbox ~]# storm jar storm-starter-0.0.1-storm-0.9.0.1.jar storm.starter.WordCountTopology WordCount -c storm.starter.WordCountTopology WordCount -c nimbus.host=sandbox.hortonworks.com

Note: For Sandbox versions without Storm preinstalled, navigate to/usr/lib/storm/bin/ directory to run the command above.

STEP 5: OPEN STORM UI

Let’s use Storm UI and look at it graphically:
enter image description here

You should notice the Storm Topology, WordCount in the Topology summary.

STEP 6: CLICK ON WORDCOUNT TOPOLOGY

The topology is located Under Topology Summary. You will see the following:

STEP 7: NAVIGATE TO BOLT SECTION

Click on count.

STEP 8: NAVIGATE TO EXECUTOR SECTION

Click on any port and you will be able to view the results.

You just processed streaming data using Apache Storm. Congratulations on completing the Tutorial!

APPENDIX A: VIEW STORM LOG FILES

Lastly but most importantly, you can always look at the log files. These logs are extremely useful for debugging or status finding. Their directory location:

[root@sandbox ~]# cd /var/log/storm

[root@sandbox storm]# ls -ltr

APPENDIX B: INSTALL MAVEN AND GET STARTED WITH STORM STARTER KIT

INSTALL MAVEN

Download and install Apache Maven as shown in the commands below

curl -o /etc/yum.repos.d/epel-apache-maven.repo https://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo
yum -y install apache-maven
mvn -version

GET STARTED WITH STORM STARTER KIT

Download the Storm Starter Kit and try other topology examples, such as ExclamationTopology and ReachTopology.

git clone git://github.com/apache/storm.git && cd storm/examples/storm-starter

1 comment:

SarikaApril 14, 2017 at 4:03 PM
Great post! A well-written resource to anyone looking to boost their Apache Storm through blog commenting. The tools mentioned will also go a long way in making the entire process much more efficient and effective.
ReplyDelete
Replies

Add comment

My Pages