You're reading from Building Python Real time Applications with Storm Learn to process massive real-time data streams using Storm and Python—no Java required!

Product type Paperback

Published in Dec 2015

Publisher

ISBN-13 9781784392857

Length 122 pages

Edition 1st Edition

Languages

Python

Tools

Storm

Concepts

Application Development

Table of Contents (9) Chapters

Preface

1. Getting Acquainted with Storm FREE CHAPTER

2. The Storm Anatomy

3. Introducing Petrel

4. Example Topology – Twitter

5. Persistence Using Redis and MongoDB

6. Petrel in Practice

A. Managing Storm Using Supervisord

Index

Storm installation

Storm can be installed in either of these two ways:

Fetch a Storm release from this location using Git:
- https://github.com/nathanmarz/storm.git
Download directly from the following link: https://storm.apache.org/downloads.html

Storm configurations can be done using storm.yaml, which is present in the conf folder.

The following are the configurations for a single-machine Storm cluster installation.

Port # 2181 is the default port of Zookeeper. To add more than one zookeeper, keep entry – separated:

storm.zookeeper.servers:
     - "localhost"

# you must change 2181 to another value if zookeeper running on another port.
storm.zookeeper.port: 2181
# In single machine mode nimbus run locally so we are keeping it localhost.
# In distributed mode change localhost to machine name where nimbus daemon is running.
nimbus.host: "localhost"
# Here storm will generate logs of workers, nimbus and supervisor.
storm.local.dir: "/var/stormtmp"
java.library.path: "/usr/local/lib"
# Allocating 4 ports for workers. More numbers can also be added.
supervisor.slots.ports:
     - 6700
     - 6701
     - 6702
     - 6703
# Memory is allocated to each worker. In below case we are allocating 768 mb per worker.worker.childopts: "-Xmx768m"
# Memory to nimbus daemon- Here we are giving 512 mb to nimbus.
nimbus.childopts: "-Xmx512m"
# Memory to supervisor daemon- Here we are giving 256 mb to supervisor.

Note

Notice supervisor.childopts: "-Xmx256m". In this setting, we reserved four supervisor ports, which means that a maximum of four worker processes can run on this machine.

storm.local.dir: This directory location should be cleaned if there is a problem with starting Nimbus and Supervisor. In the case of running a topology on the local IDE on a Windows machine, C:\Users\<User-Name>\AppData\Local\Temp should be cleaned.

Enabling native (Netty only) dependency

Netty enables inter JVM communication and it is very simple to use.

Netty configuration

You don't really need to install anything extra for Netty. This is because it's a pure Java-based communication library. All new versions of Storm support Netty.

Add the following lines to your storm.yaml file. Configure and adjust the values to best suit your use case:

storm.messaging.transport: "backtype.storm.messaging.netty.Context"
storm.messaging.netty.server_worker_threads: 1
storm.messaging.netty.client_worker_threads: 1
storm.messaging.netty.buffer_size: 5242880
storm.messaging.netty.max_retries: 100
storm.messaging.netty.max_wait_ms: 1000
storm.messaging.netty.min_wait_ms: 100

Starting daemons

Storm daemons are the processes that are needed to pre-run before you submit your program to the cluster. When you run a topology program on a local IDE, these daemons auto-start on predefined ports, but over the cluster, they must run at all times:

Start the master daemon, nimbus. Go to the bin directory of the Storm installation and execute the following command (assuming that zookeeper is running):

   ./storm nimbus
     Alternatively, to run in the background, use the same command with nohup, like this:
    Run in background
    nohup ./storm nimbus &

Now we have to start the supervisor daemon. Go to the bin directory of the Storm installation and execute this command:
```
  ./storm supervisor
```
To run in the background, use the following command:
```
         nohup ./storm  supervisor &
```
Note
If Nimbus or the Supervisors restart, the running topologies are unaffected as both are stateless.
Let's start the storm UI. The Storm UI is an optional process. It helps us to see the Storm statistics of a running topology. You can see how many executors and workers are assigned to a particular topology. The command needed to run the storm UI is as follows:
```
       ./storm ui
```
Alternatively, to run in the background, use this line with nohup:
```
       nohup ./storm ui &
```
To access the Storm UI, visit http://localhost:8080.
We will now start storm logviewer. Storm UI is another optional process for seeing the log from the browser. You can also see the storm log using the command-line option in the $STORM_HOME/logs folder. To start logviewer, use this command:
```
         ./storm logviewer
```
To run in the background, use the following line with nohup:
```
         nohup ./storm logviewer &
```
Note
To access Storm's log, visit http://localhost:8000log viewer daemon should run on each machine. Another way to access the log of <machine name> for worker port 6700 is given here:
```
<Machine name>:8000/log?file=worker-6700.log
```
DRPC daemon: DRPC is another optional service. DRPC stands for Distributed Remote Procedure Call. You will require the DRPC daemon if you want to supply to the storm topology an argument externally through the DRPC client. Note that an argument can be supplied only once, and the DRPC client can wait for long until storm topology does the processing and the return. DRPC is not a popular option to use in projects, as firstly, it is blocking to the client, and secondly, you can supply only one argument at a time. DRPC is not supported by Python and Petrel.

Summarizing, the steps for starting processes are as follows:

First, all the Zookeeper daemons.
Nimbus daemons.
Supervisor daemon on one or more machine.
The UI daemon where Nimbus is running (optional).
The Logviewer daemon (optional).
Submitting the topology.

You can restart the nimbus daemon anytime without any impact on existing processes or topologies. You can restart the supervisor daemon and can also add more supervisor machines to the Storm cluster anytime.

To submit jar to the Storm cluster, go to the bin directory of the Storm installation and execute the following command:

./storm jar <path-to-topology-jar> <class-with-the-main> <arg1> … <argN>

Playing with optional configurations

All the previous settings are required to start the cluster, but there are many other settings that are optional and can be tuned based on the topology's requirement. A prefix can help find the nature of a configuration. The complete list of default yaml configuration is available at https://github.com/apache/storm/blob/master/conf/defaults.yaml.

Configurations can be identified by how the prefix starts. For example, all UI configurations start with ui*.

Nature of the configuration	Prefix to look into
General	`storm.*`
Nimbus	`nimbus.*`
UI	`ui.*`
Log viewer	`logviewer.*`
DRPC	`drpc.*`
Supervisor	`supervisor.*`
Topology	`topology.*`

All of these optional configurations can be added to STORM_HOME/conf/storm.yaml for any change other than the default values. All settings that start with topology.* can either be set programmatically from the topology or from storm.yaml. All other settings can be set only from the storm.yaml file. For example, the following table shows three different ways to play with these parameters. However, all of these three do the same thing:

/conf/storm.yaml	Topology builder	Custom yaml
Changing `storm.yaml` (impacts all the topologies of the cluster)	Changing the topology builder while writing code (impacts only the current topology)	Supplying `topology.yaml` as a command-line option (impacts only the current topology)
`topology.workers: 1`	`conf.setNumberOfWorker(1);` This is supplied through Python code	Create `topology.yaml` with the entry made into it similar to `storm.yaml`, and supply it when running the topology Python: `petrel submit --config topology.yaml`

/conf/storm.yaml

Topology builder

Custom yaml

Changing storm.yaml

(impacts all the topologies of the cluster)

Changing the topology builder while writing code

(impacts only the current topology)

Supplying topology.yaml as a command-line option

(impacts only the current topology)

topology.workers: 1

conf.setNumberOfWorker(1);

This is supplied through Python code

Create topology.yaml with the entry made into it similar to storm.yaml, and supply it when running the topology

Python:

petrel submit --config topology.yaml

Any configuration change in storm.yaml will affect all running topologies, but when using the conf.setXXX option in code, different topologies can overwrite that option, what is best suited for each of them.