A clustering algorithm is one that identifies groups of data points according to their proximity to each other. These algorithms are similar to classification algorithms in that they also partition a dataset into subsets of similar points. But, in classification, we already have data whose classes have been identified. such as sweet fruit. In clustering, we seek to discover the unknown groups themselves.
Of the several clustering algorithms that we will examine in this article, hierarchical clustering is probably the simplest. The trade-off is that it works well only with small datasets in Euclidean space.
The general setup is that we have a dataset S of m points in Rn which we want to partition into a given number k of clusters C1 , C2 ,..., Ck , where within each cluster the points are relatively close together.
Here is the algorithm:
The centroid of a cluster is the point whose coordinates are the averages of the corresponding coordinates of the cluster points. For example, the centroid of the cluster C = {(2, 4), (3, 5), (6, 6), (9, 1)} is the point (5, 4), because (2 + 3 + 6 + 9)/4 = 5 and (4 + 5 + 6 + 1)/4 = 4. This is illustrated in the figure below :
A popular alternative to hierarchical clustering is the K-means algorithm. It is related to the K-Nearest Neighbor (KNN) classification algorithm. As with hierarchical clustering, the K-means clustering algorithm requires the number of clusters, k, as input. (This version is also called the K-Means++ algorithm)
Here is the algorithm:
It also requires k points, one for each cluster, to initialize the algorithm. These initial points can be selected at random, or by some a priori method. One approach is to run hierarchical clustering on a small sample taken from the given dataset and then pick the centroids of those resulting clusters.
The k-medoids clustering algorithm is similar to the k-means algorithm, except that each cluster center, called its medoid, is one of the data points instead of being the mean of its points. The idea is to minimize the average distances from the medoids to points in their clusters. The Manhattan metric is usually used for these distances. Since those averages will be minimal if and only if the distances are, the algorithm is reduced to minimizing the sum of all distances from the points to their medoids. This sum is called the cost of the configuration.
Here is the algorithm:
This is illustrated by the simple example in Figure 8.16. It shows 10 data points in 2 clusters. The two medoids are shown as filled points. In the initial configuration it is:
C1 = {(1,1),(2,1),(3,2) (4,2),(2,3)}, with y1 = x1 = (1,1)
C2 = {(4,3),(5,3),(2,4) (4,4),(3,5)}, with y2 = x10 = (3,5)
The sums are
s1 = d (x2,y1) + d (x3,y1) + d (x4,y1) + d (x5,y1) = 1 + 3 + 4 + 3 = 11
s2 = d (x6,y1) + d (x7,y1) + d (x8,y1) + d (x9,y1) = 3 + 4 + 2 + 2 = 11
s = s1 + s2 = 11 + 11 = 22
The algorithm at step 3 first part changes the medoid for C1 to y1 = x3 = (3,2). This causes the clusters to change, at step 3 second part, to:
C1 = {(1,1),(2,1),(3,2) (4,2),(2,3),(4,3),(5,3)}, with y1 = x3 = (3,2)
C2 = {(2,4),(4,4),(3,5)}, with y2 = x10 = (3,5)
This makes the sums:
s1 = 3 + 2 + 1 + 2 + 2 + 3 = 13
s2 = 2 + 2 = 4
s = s1 + s2 = 13 + 4 = 17
The resulting configuration is shown in the second panel of the figure below:
At step 3 of the algorithm, the process repeats for cluster C2. The resulting configuration is shown in the third panel of the above figure. The computations are:
C1 = {(1,1),(2,1),(3,2) (4,2),(4,3),(5,3)}, with y1 = x3 = (3,2)
C2 = {(2,3),(2,4),(4,4),(3,5)}, with y2 = x8 = (2,4)
s = s1 + s2 = (3 + 2 + 1 + 2 + 3) + (1 + 2 + 2) = 11 + 5 = 16
The algorithm continues with two more changes, finally converging to the minimal configuration shown in the fifth panel of the above figure. This version of k-medoid clustering is also called partitioning around medoids (PAM).
One disadvantage of each of the clustering algorithms previously presented (hierarchical, k-means, k-medoids) is the requirement that the number of clusters k be determined in advance. The affinity propagation clustering algorithm does not have that requirement. Developed in 2007 by Brendan J. Frey and Delbert Dueck at the University of Toronto, it has become one of the most widely-used clustering methods.
Like k-medoid clustering, affinity propagation selects cluster center points, called exemplars, from the dataset to represent the clusters. This is done by message-passing between the data points.
The algorithm works with three two-dimensional arrays:
sij = the similarity between xi and xj
rik = responsibility: message from xi to xk on how well-suited xk is as an exemplar for xi
aik = availability: message from xk to xi on how well-suited xk is as an exemplar for xi
Here is the complete algorithm:
2. Repeat until convergence:
rik = sik − max {aij + s ij : j ≠ k}
aik = min {0, rkk + ∑j { max {0, rjk } : j ≠ i ∧ j ≠ k }}, for i ≠ k;
akk = ∑j { max {0, rjk } : j ≠ k }
A point xk will be an exemplar for a point xi if aik + rik = maxj {aij + rij}.
If you enjoyed this excerpt from the book Java Data Analysis by John R. Hubbard, check out the book to learn how to implement various machine learning algorithms, data visualization and more in Java.