Clustering with K-Means
This example will use the same test data from the previous example, but we will attempt to find clusters in the data using the MLlib K-Means algorithm.
Theory on Clustering
The K-Means algorithm iteratively attempts to determine clusters within the test data by minimizing the distance between the mean value of cluster center vectors, and the new candidate cluster member vectors. The following equation assumes dataset members that range from X1 to Xn; it also assumes K cluster sets that range from S1 to Sk, where K <= n.
K-Means in practice
The K-Means MLlib functionality uses the LabeledPoint
structure to process its data and so it needs numeric input data. As the same data from the last section is being reused, we will not explain the data conversion again. The only change that has been made in data terms in this section, is that processing in HDFS will now take place under the /data/spark/kmeans/
directory. Additionally, the conversion Scala script for the K-Means...