Reliability of data processing
One of the USPs of Storm is guaranteed message processing that makes it a very lucrative solution. Having said that, we as programmers have to make certain modeling to use or not use to the reliability provided for by Storm.
First of all, it's very important to understand what happens when a tuple is emitted into the topology and how its corresponding DAG is constructed. The following diagram captures a typical case in this context:
Here, the function of the topology is very clear: every emitted tuple has to be filtered, calculated, and written to the HDFS and database. Now, let's take an implication of DAG with respect to a single tuple being emitted into the topology.
Every single tuple that is emitted into the topology moves as follows:
- Spout A -> Bolt A -> Bolt D -> Database
- Spout A -> Bolt B -> Bolt D -> Database
- Spout A -> Bolt C -> HDFS
So, one tuple from spout A is replicated at step 1 into three tuples that move to Bolt...