Apache Storm is a free and open source, distributed real-time computation system for processing fast, large streams of data. Storm adds reliable real-time data processing capabilities to Apache Hadoop 2.x. Its effective stream processing capabilities are trusted by Twitter and Yahoo for quickly extracting insights from their Big Data.
- What is Apache Storm?
- Step 1: Check Storm Service is Running
- Step 2: Download the Storm Topology JAR file
- Step 3: Check Classes Available in jar
- Step 4: Run Word Count Topology
- Step 5: Open Storm UI
- Step 6: Click on WordCount Topology
- Step 7: Navigate to Bolt Section
- Step 8: Navigate to Executor Section
- Appendix A: View Storm Log Files
- Appendix B: Install Maven and Get Started Storm Starter Kit
- Further Reading
WHAT IS APACHE STORM?
Apache Storm is an open source engine which can process data in realtime using its distributed architecture. Storm is simple and flexible. It can be used with any programming language of your choice.
Let’s look at the various components of a Storm Cluster:
- Nimbus node. The master node (Similar to JobTracker)
- Supervisor nodes. Starts/stops workers & communicates with Nimbus through Zookeeper
- ZooKeeper nodes. Coordinates the Storm cluster
Architechture: Nimbus, Zookeeper, Supervisor
Here are a few terminologies and concepts you should get familiar with before we go hands-on:
- Tuples. An ordered list of elements. For example, a “4-tuple” might be (7, 1, 3, 7)
- Streams. An unbounded sequence of tuples.
- Spouts. Sources of streams in a computation (e.g. a Twitter API)
- Bolts. Process input streams and produce output streams. They can:
- Run functions;
- Filter, aggregate, or join data;
- Talk to databases.
- Topologies. The overall calculation, represented visually as a network of spouts and bolts
Basic Concepts Map: Topologies process data when it comes streaming in from the spout, the bolt processes it and the results are passed into Hadoop.
INSTALLATION AND SETUP VERIFICATION:
STEP 1: CHECK STORM SERVICE IS RUNNING
Let’s check if the sandbox has storm processes up and running by login into Ambari and look for Storm in the services listed:
STEP 2: DOWNLOAD THE STORM TOPOLOGY JAR FILE
Now let’s look at a Streaming use case using Storm’s Spouts and Bolts processes. For this we will be using a simple use case, however it should give you the real life experience of running and operating on Hadoop Streaming data using this topology.
Let’s get the jar file which is available in the Storm Starter kit. This has other examples as well, but let’s use the WordCount operation and see how to turn it ON. We will also track this in Storm UI.
wget http://public-repo-1.hortonworks.com/HDP-LABS/Projects/Storm/0.9.0.1/storm-starter-0.0.1-storm-0.9.0.1.jar
STEP 3: CHECK CLASSES AVAILABLE IN JAR
In the Storm example Topology, we will be using three main parts or processes:
- Sentence Generator Spout
- Sentence Split Bolt
- WordCount Bolt
You can check the classes available in the jar as follows:
jar -xvf storm-starter-0.0.1-storm-0.9.0.1.jar | grep Sentence
jar -xvf storm-starter-0.0.1-storm-0.9.0.1.jar | grep Split
jar -xvf storm-starter-0.0.1-storm-0.9.0.1.jar | grep WordCount
STEP 4: RUN WORD COUNT TOPOLOGY
Let’s run the storm job. It has a Spout job to generate random sentences while the bolt counts the different words. There is a split Bolt Process along with the Wordcount Bolt Class.
Let’s run the Storm Jar file.
[root@sandbox ~]# storm jar storm-starter-0.0.1-storm-0.9.0.1.jar storm.starter.WordCountTopology WordCount -c storm.starter.WordCountTopology WordCount -c nimbus.host=sandbox.hortonworks.com
Note: For Sandbox versions without Storm preinstalled, navigate to/usr/lib/storm/bin/
directory to run the command above.
STEP 5: OPEN STORM UI
Let’s use Storm UI and look at it graphically:
You should notice the Storm Topology, WordCount in the Topology summary.
STEP 6: CLICK ON WORDCOUNT TOPOLOGY
The topology is located Under Topology Summary. You will see the following:
STEP 7: NAVIGATE TO BOLT SECTION
Click on count.
STEP 8: NAVIGATE TO EXECUTOR SECTION
Click on any port and you will be able to view the results.
You just processed streaming data using Apache Storm. Congratulations on completing the Tutorial!
APPENDIX A: VIEW STORM LOG FILES
Lastly but most importantly, you can always look at the log files. These logs are extremely useful for debugging or status finding. Their directory location:
[root@sandbox ~]# cd /var/log/storm
[root@sandbox storm]# ls -ltr
APPENDIX B: INSTALL MAVEN AND GET STARTED WITH STORM STARTER KIT
INSTALL MAVEN
Download and install Apache Maven as shown in the commands below
curl -o /etc/yum.repos.d/epel-apache-maven.repo https://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo
yum -y install apache-maven
mvn -version
GET STARTED WITH STORM STARTER KIT
Download the Storm Starter Kit and try other topology examples, such as ExclamationTopology and ReachTopology.
git clone git://github.com/apache/storm.git && cd storm/examples/storm-starter
Great post! A well-written resource to anyone looking to boost their Apache Storm through blog commenting. The tools mentioned will also go a long way in making the entire process much more efficient and effective.
ReplyDelete