Stream processing
Streaming is a very useful mode of processing data and can come with a large amount of complexity. One thing a purist must consider is that Spark doesn’t do “streaming data” – Spark does micro-batch data processing. So, it will load whatever the new messages are and run a batch process on them in a continuous loop while checking for new data. A pure streaming data processing engine such as Apache Flink will only process one new load of “data.” So, as a simple example, let’s say there are 100 new messages in a Kafka queue; Spark would process all of them in one micro-batch. Flink, on the other hand, would process each message separately.
Spark Structured Streaming is a DataFrame API on top of the normal Spark Streaming, much like the DataFrame API sits on the RDD API. Streaming DataFrames are optimized just like normal DataFrames, so I suggest always using structured streaming over normal Spark Streaming. Also, Spark...