K-Means using Mahout
K-Means is a clustering algorithm that aims to partition n
observations in k
clusters.
Clustering is a form of unsupervised learning that can be successfully applied to a wide variety of problems. The algorithm is computationally difficult, and the open source project Mahout provides distributed implementations of many machine algorithms.
Note
Find more detailed information on K-Means at http://mahout.apache.org/users/clustering/k-means-clustering.html.
The K-Means algorithm assigns observations to the nearest cluster. Initially, the algorithm is instructed how many clusters to identify. For each cluster, a random centroid is generated. Samples are partitioned into clusters by minimizing a measure between the samples and the centroids of the cluster. In a number of iterations, the centroids and the assignments of samples in clusters are refined.
The distance between each sample and a centroid can be measured in a number of ways. Euclidean is usually used for samples in numerical...