Streaming KMeans to classify data in near real-time
Spark streaming is a powerful facility which lets you combine near real-time and batch in the same paradigm. The streaming KMeans interface lives at the intersection of ML clustering and Spark streaming, and takes full advantage of the core facilities provided by Spark streaming itself (for example, fault tolerance, exactly once delivery semantics, and so on).
How to do it...
- Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.
- Import the necessary packages for streaming KMeans:Â
package spark.ml.cookbook.chapter14
.
- Import the necessary packages for streaming KMeans:
import org.apache.log4j.{Level, Logger} import org.apache.spark.mllib.clustering.StreamingKMeans import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.sql.SparkSession import org.apache.spark.streaming.{Seconds, StreamingContext}
We set up the following parameters...