Merging or applying Change Data Capture on Apache Spark Structured Streaming and Delta Lake
In this recipe, we will use Spark Structured Streaming to read data from a Kafka topic that contains change events and merge them into a Delta Lake table that maintains the latest state of the users.
CDC is a technique to capture changes made to a data source and apply them to a target system. CDC can be used for data replication, data integration, data warehousing, and stream processing.
Getting ready
Before we start, we need to make sure that we have a Kafka cluster running and a topic that produces some streaming data. For simplicity, we will use a single-node Kafka cluster and a topic named users
. Open the 5.0 user-gen-kafka.ipynb
notebook and execute the cell. This notebook produces a user record every few seconds and puts it on a Kafka topic called users
.
Make sure you have run this notebook and that it is producing records as shown:
Figure 5.10 ...