Using Kafka with Spark Structured Streaming
Apache Kafka is a distributed platform. It enables to publish and subscribe to data streams, and process and store them as they get produced. Kafka’s widespread adoption by the industry for web-scale applications is because of its high throughput, low latency, high scalability, high concurrency, reliability, and fault-tolerance features.
Introducing Kafka concepts
Kafka is typically used to build real-time streaming pipelines to move data between systems, reliably, and also to transform and react to the streams of data. Kafka is run as a cluster on one or more servers.
Some of the key concepts of Kafka are described here:
Topic: High-level abstraction for a category or name to which messages are published. A topic can have
0
,1
, or many consumers who subscribe to the messages published to it. Users define a new topic for each new category of messages.Producers: Clients that messages to a topic.
Consumers: Clients that consume from a topic.
Brokers...