4 Clustering Algorithms every Data Scientist should know

[box type="note" align="" class="" width=""]This is an excerpt from a book by John R. Hubbard titled Java Data Analysis. In this article, we see the four popular clustering algorithms: hierarchical clustering, k-means clustering, k-medoids clustering, and the affinity propagation algorithms along with their pseudo-codes.[/box]

A clustering algorithm is one that identifies groups of data points according to their proximity to each other. These algorithms are similar to classification algorithms in that they also partition a dataset into subsets of similar points. But, in classification, we already have data whose classes have been identified. such as sweet fruit. In clustering, we seek to discover the unknown groups themselves.

Hierarchical clustering

Of the several clustering algorithms that we will examine in this article, hierarchical clustering is probably the simplest. The trade-off is that it works well only with small datasets in Euclidean space.

The general setup is that we have a dataset S of m points in Rn which we want to partition into a given number k of clusters C1 , C2 ,..., Ck , where within each cluster the points are relatively close together.

Here is the algorithm:

Create a singleton cluster for each of the m data points.
Repeat m – k times:

Find the two clusters whose centroids are closest
Replace those two clusters with a new cluster that contains their points

The centroid of a cluster is the point whose coordinates are the averages of the corresponding coordinates of the cluster points. For example, the centroid of the cluster C = {(2, 4), (3, 5), (6, 6), (9, 1)} is the point (5, 4), because (2 + 3 + 6 + 9)/4 = 5 and (4 + 5 + 6 + 1)/4 = 4. This is illustrated in the figure below :

4-clustering-algorithms-every-data-scientist-know-img-0

K-means clustering

A popular alternative to hierarchical clustering is the K-means algorithm. It is related to the K-Nearest Neighbor (KNN) classification algorithm. As with hierarchical clustering, the K-means clustering algorithm requires the number of clusters, k, as input. (This version is also called the K-Means++ algorithm)

Here is the algorithm:

Select k points from the dataset.
Create k clusters, each with one of the initial points as its centroid.
For each dataset point x that is not already a centroid:

Find the centroid y that is closest to x
Add x to that centroid’s cluster
Re-compute the centroid for that cluster

It also requires k points, one for each cluster, to initialize the algorithm. These initial points can be selected at random, or by some a priori method. One approach is to run hierarchical clustering on a small sample taken from the given dataset and then pick the centroids of those resulting clusters.

K-medoids clustering

The k-medoids clustering algorithm is similar to the k-means algorithm, except that each cluster center, called its medoid, is one of the data points instead of being the mean of its points. The idea is to minimize the average distances from the medoids to points in their clusters. The Manhattan metric is usually used for these distances. Since those averages will be minimal if and only if the distances are, the algorithm is reduced to minimizing the sum of all distances from the points to their medoids. This sum is called the cost of the configuration.

Here is the algorithm:

Select k points from the dataset to be medoids.
Assign each data point to its closest medoid. This defines the k clusters.
For each cluster Cj :

Compute the sum s = ∑ j s j , where each sj = ∑{ d (x, yj) : x ∈ Cj } , and change the medoid yj to whatever point in the cluster Cj that minimizes s
If the medoid yj was changed, re-assign each x to the cluster whose medoid is closest

Repeat step 3 until s is minimal.

This is illustrated by the simple example in Figure 8.16. It shows 10 data points in 2 clusters. The two medoids are shown as filled points. In the initial configuration it is:

C1 = {(1,1),(2,1),(3,2) (4,2),(2,3)}, with y1 = x1 = (1,1)

C2 = {(4,3),(5,3),(2,4) (4,4),(3,5)}, with y2 = x10 = (3,5)

The sums are

s1 = d (x2,y1) + d (x3,y1) + d (x4,y1) + d (x5,y1) = 1 + 3 + 4 + 3 = 11

s2 = d (x6,y1) + d (x7,y1) + d (x8,y1) + d (x9,y1) = 3 + 4 + 2 + 2 = 11

s = s1 + s2 = 11 + 11 = 22

The algorithm at step 3 first part changes the medoid for C1 to y1 = x3 = (3,2). This causes the clusters to change, at step 3 second part, to:

C1 = {(1,1),(2,1),(3,2) (4,2),(2,3),(4,3),(5,3)}, with y1 = x3 = (3,2)

C2 = {(2,4),(4,4),(3,5)}, with y2 = x10 = (3,5)

This makes the sums:

s1 = 3 + 2 + 1 + 2 + 2 + 3 = 13

s2 = 2 + 2 = 4

s = s1 + s2 = 13 + 4 = 17

The resulting configuration is shown in the second panel of the figure below:

4-clustering-algorithms-every-data-scientist-know-img-1

At step 3 of the algorithm, the process repeats for cluster C2. The resulting configuration is shown in the third panel of the above figure. The computations are:

C1 = {(1,1),(2,1),(3,2) (4,2),(4,3),(5,3)}, with y1 = x3 = (3,2)

C2 = {(2,3),(2,4),(4,4),(3,5)}, with y2 = x8 = (2,4)

s = s1 + s2 = (3 + 2 + 1 + 2 + 3) + (1 + 2 + 2) = 11 + 5 = 16

The algorithm continues with two more changes, finally converging to the minimal configuration shown in the fifth panel of the above figure. This version of k-medoid clustering is also called partitioning around medoids (PAM).

Affinity propagation clustering

One disadvantage of each of the clustering algorithms previously presented (hierarchical, k-means, k-medoids) is the requirement that the number of clusters k be determined in advance. The affinity propagation clustering algorithm does not have that requirement. Developed in 2007 by Brendan J. Frey and Delbert Dueck at the University of Toronto, it has become one of the most widely-used clustering methods.

Like k-medoid clustering, affinity propagation selects cluster center points, called exemplars, from the dataset to represent the clusters. This is done by message-passing between the data points.

The algorithm works with three two-dimensional arrays:

sij = the similarity between xi and xj

rik = responsibility: message from xi to xk on how well-suited xk is as an exemplar for xi

aik = availability: message from xk to xi on how well-suited xk is as an exemplar for xi

Here is the complete algorithm:

Initialize the similarities:

sij = –d(xi , xj )2 , for i ≠ j;
sii = the average of those other sij values

2. Repeat until convergence:

Update the responsibilities:

rik = sik − max {aij + s ij : j ≠ k}

Update the availabilities:

aik = min {0, rkk + ∑j { max {0, rjk } : j ≠ i ∧ j ≠ k }}, for i ≠ k;

akk = ∑j { max {0, rjk } : j ≠ k }

A point xk will be an exemplar for a point xi if aik + rik = maxj {aij + rij}.

If you enjoyed this excerpt from the book Java Data Analysis by John R. Hubbard, check out the book to learn how to implement various machine learning algorithms, data visualization and more in Java.