Streaming KMeans for a real-time on-line classifier
In this recipe, we explore the streaming version of KMeans in Spark used in unsupervised learning schemes. The purpose of streaming KMeans algorithm is to classify or group a set of data points into a number of clusters based on their similarity factor.
There are two implementations of the KMeans classification method, one for static/offline data and another version for continuously arriving real-time updating data.
We will be streaming iris dataset clustering as new data streams into our streaming context.
How to do it...
- Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
- Set up the package location where the program will reside:
package spark.ml.cookbook.chapter13
- Import the necessary packages:
import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext import scala.collection...