Configuring checkpoints for Structured Streaming in Apache Spark
In this recipe, we will learn how to configure checkpoints for stateful streaming queries in Apache Spark. Checkpoints are a mechanism to ensure the fault tolerance and reliability of streaming applications by saving the intermediate state of the query to a durable storage system. Checkpoints can also help recover from failures and resume the query from where it left off.
Getting ready
Before we start, we need to make sure that we have a Kafka cluster running and a topic that produces some streaming data. For simplicity, we will use a single-node Kafka cluster and a topic named users
. Open the 4.0 user-gen-kafka.ipynb
notebook and execute the cell. This notebook produces a user record every few seconds and puts it on a Kafka topic called users
.
Make sure you have run this notebook and that it is producing records as shown here:
Figure 4.9 – Output from user generation script
...