Interoperability with streaming platforms (Apache Kafka)
Spark Streaming integrates well with Apache Kafka, currently the most popular messaging platform. This integration has several approaches, and the mechanism has improved over time with regards to performance and reliability.
There are three main approaches:
- Receiver-based approach
- Direct Stream approach
- Structured Streaming
Receiver-based
The first integration between Spark and Kafka is the receiver-based integration. In the receiver-based approach, the driver starts the receivers on the executors, which then pull data using a high-level API from the Kafka brokers. Since the events are being pulled from the Kafka brokers, the receivers update the offsets into Zookeeper, which is also used by the Kafka cluster. The important aspect here is the use of the write ahead log (WAL), which is what the receiver writes to as it collects data from Kafka. If there is a problem and the executors and receivers have to restart or are lost, the WAL can...