Using StreamingLogisticRegression to classify a Twitter stream using Kafka as a training stream
In the previous recipe, we published all the tweets that were stored in ElasticSearch to a Kafka topic. In this recipe, we'll subscribe to the Kafka stream and train a classification model out of it. We will later use this trained model to classify a live Twitter stream.
How to do it...
This is a really small recipe that is composed of 3 steps:
Subscribing to a Kafka stream: There are two ways to subscribe to a Kafka stream and we'll be using the
DirectStream
method, which is faster. Just like Twitter streaming, Spark has first-class support for subscribing to a Kafka stream. This is achieved by adding thespark-streaming-kafka
dependency. Let's add it to ourbuild.sbt
file:"org.apache.spark" %% "spark-streaming-kafka" % sparkVersion
The subscription process is more or less the reverse of the publishing process even in terms of the properties that we pass to Kafka:
val topics = Set("twtopic") val kafkaParams...