Spark and Kafka
Spark has a long history of supporting Kafka with both streaming and batch processing. Here, we will go over some of the structured streaming Kafka-related APIs.
Here, we have a streaming read of a Kafka cluster. It will return a streaming DataFrame:
df = spark \ .readStream \ .format("kafka")\ .option("kafka.bootstrap.servers", "<host>:<port>, <host>:<port>")\ .option("subscribe", "<topic>")\ .load()\\
Conversely, if you want to do a true batch process, you can also read from Kafka. Keep in mind that we have covered techniques to create a streaming context but using a batch style to avoid rereading messages:
df = spark \ .read \ .format("kafka") \ .option("kafka.bootstrap.servers", "<host>:<port>, <host>:<port>")\ ...