Before we dive into structured streaming, let's start by talking about DStreams. DStreams are built on top of RDDs and represent a stream of data divided into small chunks. The following figure represents these data chunks in micro-batches of milliseconds to seconds. In this example, the lines of DStream is micro-batched into seconds where each square represents a micro-batch of events that occurred within that second window:
- At time interval 1 second, there were five occurrences of the event blue and three occurrences of the event green
- At time interval 2 seconds, there is a single occurrence of gohawks
- At time interval 4 seconds, there are two occurrences of the event green
Because DStreams are built on top of RDDs, Apache Spark's core data abstraction, this allows Spark Streaming to easily integrate with other Spark components...