What is Structured Streaming?
We've covered discretized streams in quite a lot of detail. However, if you have been following the Spark news recently, you may have heard of the new DataFrame/DataSet-based streaming framework named Structured Streaming. Why is there a need for a new streaming framework? We've talked about how revolutionary the concept of Spark Streaming using DStreams was, and how you can actually combine multiple engines such as SQL, Streaming, Graph, and ML to build a data pipeline, so why the need for a new engine altogether?
Based on the experience with Spark Streaming, the team at Apache Spark released that there were a few issues with DStreams. The top three issues were as follows:
- As we have seen in the preceding examples, DStreams can work with the batch time, but not the event time inside the data.
- While every effort was made to keep the API similar, the Streaming API was still different to RDD API in the sense that you cannot take a Batch job and start running it as...