Idempotent stream writing with Delta Lake and Apache Spark Structured Streaming
In this recipe, you will learn how to perform idempotent stream writing with Delta Lake and Apache Spark Structured Streaming. Idempotent stream writing means that the same data can be written to a Delta table multiple times without changing the final result. This is useful for scenarios where you need to ensure exactly-once processing of streaming data, such as deduplicating records, upserting data, or handling failures and retries.
Getting ready
Before we start, we need to make sure that we have a Kafka cluster running and a topic that produces some streaming data. For simplicity, we will use a single-node Kafka cluster and a topic named users
. Open the 5.0 user-gen-kafka.ipynb
notebook and execute the cell. This notebook produces a user record every few seconds and puts it on a Kafka topic called users
.
Make sure you have run this notebook and that it is producing records as shown: