Kafka
Kafka is a publish-subscribe messaging system that provides a reliable Spark Streaming source. With the latest Kafka direct API, it provides one-to-one mapping between Kafka's partition and the DStream generated RDDs partition along with access to metadata and offset. Since, Kafka is an advanced streaming source as far as Spark Streaming is concerned, one needs to add its dependency in the build tool of the streaming application. The following is the artifact that should be added in the build tool of one's choice before starting with Kafka integration:
groupId = org.apache.spark artifactId = spark-streaming-kafka-0-10_2.11 version = 2.1.1
After adding the dependency, one also needs basic information about the Kafka setup, such as the server(s) on which Kafka is hosted (bootstrap.servers
) and some of the basic configurations describing the message, such as sterilizer, group ID, and so on. The following are a few common properties used to describe a Kafka connection:
bootstrap.servers...