Joining streaming data with streaming data in Apache Spark Structured Streaming and Delta Lake
In Apache Spark Structured Streaming, stream-to-stream joins refer to the capability of combining two or more streaming DataFrames or Datasets based on a common key. This operation enables the merging of ongoing, real-time data streams, allowing for dynamic and continuous analysis of correlated information. The result is a new streaming DataFrame that evolves over time as the input streams are updated, facilitating real-time processing and analytics on streaming data.
In this recipe, you will learn how to join two streams of data using Apache Spark Structured Streaming and Delta Lake. You will also learn how to handle late-arriving and out-of-order data, and how to update the join results as new data arrives. Here is a diagram that shows how the two streams are joined in this recipe:
Figure 5.14 – Stream-to-stream joins in structured streaming