Handling out-of-order and late-arriving events with watermarking in Apache Spark Structured Streaming
In this recipe, you will learn how to use watermarking to handle out-of-order and late-arriving events in a streaming application that computes the average temperature of different cities over a sliding window of time. You will use Spark SQL to define the streaming query and the watermark logic. You will also learn how to monitor the progress and performance of your streaming application using the Spark UI.
Watermarking is a technique that allows Apache Spark Structured Streaming to handle out-of-order and late-arriving events in streaming applications. It enables the system to specify how late the data can be and handle old data or data that arrives after the expected window accordingly. Watermarking also allows the system to free up states and resources by discarding old data that is no longer relevant.
Getting ready
Before we start, we need to make sure that we have a Kafka...