K-Means clustering
K-Means clustering is a simple and fast clustering algorithm that has been widely adopted in many problem domains. In this chapter, we will give a detailed explanation of the K-Means algorithm, as it will provide the base for other algorithms. K-Means clustering assigns data points to k number of clusters (cluster centroids) by minimizing the distance from the data points to the cluster centroids.
Let's consider a simple scenario where we need to cluster people based on their size (height and weight are the selected attributes) and different colors (clusters):
We can plot this problem in two-dimensional space, as shown in the following figure and solve it using the K-Means algorithm:
Getting your hands dirty!
Let's move on to a real implementation of the K-Means algorithm using Apache Mahout. The following are the different ways in which you can run algorithms in Apache Mahout:
Sequential
MapReduce
You can execute the algorithms using a command line (by calling the correct bin...