Summary
So why do we have two different streaming engines within the same data processing framework? We hope that after reading this chapter, you'll agree that the main pain points of the classical DStream based engine have been addressed. Formerly, event time-based processing was not possible and only the arrival time of data was considered. Then, late data has simply been processed with the wrong timestamp as only processing time could be used. Also, batch and stream processing required using two different APIs: RDDs and DStreams. Although the API is similar, it is not exactly the same; therefore, the rewriting of code when going back and forth between the two paradigms was necessary. Finally, an end-to-end delivery guarantee was hard to achieve and required lots of user intervention and thinking.
This fault-tolerant end-to-end exactly-once delivery guarantee is achieved through offset tracking and state management in a fault-tolerant Write Ahead Log in conjunction with fault-tolerant sources...