[box type="note" align="" class="" width=""]In this article by Shilpi Saxena and Saurabh Gupta from their book Practical Real-time data Processing and Analytics we shall explore Storm's architecture with its components and configure it to run in a cluster. [/box]
Initially, real-time processing was implemented by pushing messages into a queue and then reading the messages from it using Python or any other language to process them one by one. The primary challenges with this approach were:
Below are the two main reasons that make Storm a highly reliable real-time engine:
The following are the primary characteristics of Storm that make it special and ideal for real-time processing.
Reliable: Spout keeps track of each tuple and replay tuple in case of any failure.
Unreliable: Spout does not care about the tuple once it is emitted as a stream to another bolt or spout.
Before setting up Storm, we need to setup Zookeeper which is required by Storm:
Below are instructions on how to install, configure and run Zookeeper in standalone and cluster mode:
Download Zookeeper from http://www-eu.apache.org/dist/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz. After the download, extract zookeeper-3.4.6.tar.gz
as below:
tar -xvf zookeeper-3.4.6.tar.gz
The following files and folders will be extracted:
There are two types of deployment with Zookeeper; they are standalone and cluster. There is no big difference in configuration, just new extra parameters for cluster mode.
As shown, in the previous figure, go to the conf folder and change the zoo.cfg file as follows:
tickTime=2000 # Length of single tick in milliseconds. It is used to
# regulate heartbeat and timeouts.
initLimit=5 # Amount of time to allow followers to connect and sync
# with leader.
syncLimit=2 # Amount of time to allow followers to sync with
# Zookeeper
dataDir=/tmp/zookeeper/tmp # Directory where Zookeeper keeps
# transaction logs
clientPort=2182 # Listening port for client to connect.
maxClientCnxns=30 # Maximum limit of client to connect to Zookeeper
# node.
In addition to above configuration, add the following configuration to the cluster as well:
server.1=zkp-1:2888:3888
server.2=zkp-2:2888:3888
server.3=zkp-3:2888:3888
server.x=[hostname]nnnn:mmmm : Here x is id assigned to each Zookeeper node. In datadir, configured above, create a file "myid" and put corresponding ID of Zookeeper in it. It should be unique across the cluster. The same ID is used as x here. Nnnn is the port used by followers to connect with leader node and mmmm is the port used for leader election.
Use the following command to run Zookeeper from the Zookeeper home dir:
/bin/zkServer.sh start
The console will come out after the below message and the process will run in the background.
Starting zookeeper ... STARTED
The following command can be used to check the status of Zookeeper process:
/bin/zkServer.sh status
The following output would be in standalone mode:
Mode: standalone
The following output would be in cluster mode:
Mode: follower # in case of follower node
Mode: leader # in case of leader node
Below are instructions on how to install, configure and run Storm with nimbus and supervisors.
Download Storm from http://www.apache.org/dyn/closer.lua/storm/apache-storm-1.0.3/apache-storm-1.0.3.tar.gz. After the download, extract apache-storm-1.0.3.tar.gz, as follows:
tar -xvf apache-storm-1.0.3.tar.gz
Below are the files and folders that will be extracted:
Configuring
As shown, in the previous figure, go to the conf folder and add/edit properties in storm.yaml
:
storm.zookeeper.servers:
- "zkp-1"
- "zkp-2"
- "zkp-3"
storm.zookeeper.port: 2182
nimbus.host: "nimbus"
storm.local.dir: "/usr/local/storm/tmp"
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
- 6704
- 6705
worker.childopts: "-Xmx1024m"
nimbus.childopts: "-XX:+UseConcMarkSweepGC –
XX:+UseCMSInitiatingOccupancyOnly –
XX_CMSInitiatingOccupancyFraction=70"
supervisor.childopts: "-Xmx1024m"
topology.message.timeout.secs: 60
topology.debug: false
There are four services needed to start a complete Storm cluster:
/bin/storm nimbus
/bin/storm supervisor
/bin/storm ui
You can access UI on http://nimbus-host:8080. It is shown in following figure.
/bin/storm logviewer
We started with the history of Storm, where we discussed how Nathan Marz the got idea for Storm and what type of challenges he faced while releasing Storm as open source software and then in Apache. We discussed the architecture of Storm and its components. Nimbus, supervisor worker, executors, and tasks are part of Storm's architecture. Its components are tuple, stream, topology, spout, and bolt. We discussed how to set up Storm and configure it to run in the cluster. Zookeeper is required to be set up first, as Storm requires it.
The above was an excerpt from the book Practical Real-time data Processing and Analytics