Reading data from real-time sources, such as Apache Kafka, with Apache Spark Structured Streaming
In this recipe, you will learn how to read data from real-time sources, such as Apache Kafka, with Apache Spark Structured Streaming. You will use the same APIs as when working with batch data. Integrating Apache Spark and Apache Kafka offers a powerful combination of real-time data processing capabilities. It enables real-time data processing as Kafka serves as a highly scalable and fault-tolerant message broker that receives and delivers data streams, which Spark can ingest and analyze as they are generated. Kafka acts as a data buffer, ensuring that data is not lost in cases of processing delays or failures in Spark.
Getting ready
Before we start, we need to make sure that we have a Kafka cluster running and a topic that produces some streaming data. For simplicity, we will use a single-node Kafka cluster and a topic named users
. Open the 4.0 user-gen-kafka.ipynb
notebook and...