Real-time data processing with Kafka and Spark
An extremely important part of real-time data pipelines relates to real-time processing. As data gets generated continuously from various sources, such as user activity logs, IoT sensors, and more, we need to be able to make transformations on these streams of data in real time.
Apache Spark’s Structured Streaming module provides a high-level API for processing real-time data streams. It builds on top of Spark SQL and provides expressive stream processing using SQL-like operations. Spark Structured Streaming processes data streams using a micro-batch processing model. In this model, streaming data is received and collected into small batches that are processed very quickly, typically within milliseconds. This provides low processing latency while retaining the scalability of batch processing.
We will take from the real-time pipeline that we started with Kafka and build real-time processing on top of it. We will use the Spark...