Clustering using KMeans algorithm with MLib
In this recipe, we will demonstrate how you can cluster data points without labels using KMeans algorithm with MLib. As discussed in the introduction of this chapter, MLib is the machine learning component of Apache Spark and is a competitive (even better) alternative to Apache Mahout.
Getting ready
You will be using the Maven project you created in the previous recipe (solving simple text mining problems with Apache Spark). If you have not done so yet, follow steps 1-6 in the Getting ready section of that recipe.
Go to https://github.com/apache/spark/blob/master/data/mllib/kmeans_data.txt, and download the data and save as
km-data.txt
in the data folder of your project that you created by following the instruction in step 1. Alternatively, you can create a text file namedkm-data.txt
in the data folder of your project and copy-paste the data from the aforementioned URL.In the package that you created, create a Java class file named
KMeansClusteringMlib...