Introducing k-means clustering
In the previous section, you learned that unsupervised machine learning algorithms are used to extract key structural or information content from large, possibly complex datasets. These algorithms do so with little or no manual input and function without the need for training data (sets of labeled explanatory and response variables needed to train an algorithm in order to recognize the desired classification boundaries). This means that unsupervised algorithms are effective tools to generate information about the structure and content of new or unfamiliar datasets. They allow the analyst to build a strong understanding in a fraction of the time.
Clustering is probably the archetypal unsupervised learning technique for several reasons.
A lot of development time has been sunk into optimizing clustering algorithms, with efficient implementations available in most data science languages including Python.
Clustering algorithms tend to be very fast, with smoothed implementations running in polynomial time. This makes it uncomplicated to run multiple clustering configurations, even over large datasets. Scalable clustering implementations also exist that parallelize the algorithm to run over
TB-scale datasets.
Clustering algorithms are frequently easily understood and their operation is thus easy to explain if necessary.
The most popular clustering algorithm is k-means; this algorithm forms k-many clusters by first randomly initiating the clusters as k-many points in the data space. Each of these points is the mean of a cluster. An iterative process then occurs, running as follows:
- Each point is assigned to a cluster based on the least (within cluster) sum of squares, which is intuitively the nearest mean.
- The center (centroid) of each cluster becomes the new mean. This causes each of the means to shift.
Over enough iterations, the centroids move into positions that minimize a performance metric (the performance metric most commonly used is the "within cluster least sum of squares" measure). Once this measure is minimized, observations are no longer reassigned during iteration; at this point the algorithm has converged on a solution.
Kick-starting clustering analysis
Now that we've reviewed the clustering algorithm, let's run through the code and see what clustering can do for us:
Note
One critical difference between this code and the PCA code we saw previously is that this code begins by applying a scale function to the digits
dataset. This function scales values in the dataset between 0 and 1. It's critically important to scale data wherever needed, either on a log scale or bound scale, so as to prevent the magnitude of different feature values to have disproportionately powerful effects on the dataset. The key to determining whether the data needs scaling at all (and what kind of scaling is needed, within which range, and so on) is very much tied to the shape and nature of the data. If the distribution of the data shows outliers or variation within a large range, it may be appropriate to apply log-scaling. Whether this is done manually through visualization and exploratory analysis techniques or through the use of summary statistics, decisions around scaling are tied to the data under inspection and the analysis techniques to be used. A further discussion of scaling decisions and considerations may be found in Chapter 7, Feature Engineering Part II.
Helpfully, scikit-learn uses the k-means++ algorithm by default, which improves over the original k-means algorithm in terms of both running time and success rate in avoiding poor clusterings.
The algorithm achieves this by running an initialization procedure to find cluster centroids that approximate minimal variance within classes.
You may have spotted from the preceding code that we're using a set of performance estimators to track how well our k-means application is performing. It isn't practical to measure the performance of a clustering algorithm based on a single correctness percentage or using the same performance measures that are commonly used with other algorithms. The definition of success for clustering algorithms is that they provide an interpretation of how input data is grouped that trades off between several factors, including class separation, in-group similarity, and cross-group difference.
The
homogeneity score is a simple, zero-to-one-bounded measure of the degree to which clusters contain only assignments of a given class. A score of one indicates that all clusters contain measurements from a single class. This measure is complimented by the
completeness score, which is a similarly bounded measure of the extent to which all members of a given class are assigned to the same cluster. As such, a completeness score and homogeneity score of one indicates a perfect clustering solution.
The
validity measure (v-measure) is a harmonic mean of the homogeneity and completeness scores, which is exactly analogous to the F-measure for binary classification. In essence, it provides a single, 0-1-scaled value to monitor both homogeneity and completeness.
The
Adjusted Rand Index (ARI) is a similarity measure that tracks the consensus between sets of assignments. As applied to clustering, it measures the consensus between the true, pre-existing observation labels and the labels predicted as an output of the clustering algorithm. The Rand index measures labeling similarity on a 0-1 bound scale, with one equaling perfect prediction labels.
The main challenge with all of the preceding performance measures as well as other similar measures (for example, Akaike's mutual information criterion) is that they require an understanding of the ground truth, that is, they require some or all of the data under inspection to be labeled. If labels do not exist and cannot be generated, these measures won't work. In practice, this is a pretty substantial drawback as very few datasets come prelabeled and the creation of labels can be time-consuming.
One option to measure the performance of a k-means clustering solution without labeled data is the
Silhouette Coefficient. This is a measure of how well-defined the clusters within a model are. The Silhouette Coefficient for a given dataset is the mean of the coefficient for each sample, where this coefficient is calculated as follows:
The definitions of each term are as follows:
- a: The mean distance between a sample and all other points in the same cluster
- b: The mean distance between a sample and all other points in the next nearest cluster
This score is bounded between -1 and 1, with -1 indicating incorrect clustering, 1 indicating very dense clustering, and scores around 0 indicating overlapping clusters. This tends to fit our expectations of how a good clustering solution is composed.
In the case of the digits
dataset, we can employ all of the performance measures described here. As such, we'll complete the preceding example by initializing our bench_k_means
function over the digits
dataset:
This yields the following output (note that the random seed means your results will vary from mine!):
Lets take a look at these results in more detail.
The Silhouette score at 0.123
is fairly low, but not surprisingly so, given that the handwritten digits data is inherently noisy and does tend to overlap. However, some of the other scores are not that impressive. The V-measure at 0.619
is reasonable, but in this case is held back by a poor homogeneity measure, suggesting that the cluster centroids did not resolve perfectly. Moreover, the ARI at 0.465
is not great.
Note
Let's put this in context. The worst case classification attempt, random assignment, would give at best 10% classification accuracy. All of our performance measures would be accordingly very low. While we're definitely doing a lot better than that, we're still trailing far behind the best computational classification attempts. As we'll see in Chapter 4, Convolutional Neural Networks, convolutional nets achieve results with extremely low classification errors on handwritten digit datasets. We're unlikely to achieve this level of accuracy with traditional k-means clustering!
All in all, it's reasonable to think that we could do better.
To give this another try, we'll apply an additional stage of processing. To learn how to do this, we'll apply PCA—the technique we previously walked through—to reduce the dimensionality of our input dataset. The code to achieve this is very simple, as follows:
This code simply applies PCA
to the digits
dataset, yielding as many principal components as there are classes (in this case, digits). It can be sensible to review the output of PCA
before proceeding as the presence of any small principal components may suggest a dataset that contains collinearity or otherwise merits further inspection.
This instance of clustering shows noticeable improvement:
The V-measure and ARI have increased by approximately 0.08 points, with the V-measure reading a fairly respectable 0.693
. The Silhouette Coefficient did not change significantly. Given the complexity and interclass overlap within the digits
dataset, these are good results, particularly stemming from such a simple code addition!
Inspection of the digits
dataset with clusters superimposed shows that some meaningful clusters appear to have been formed. It is also apparent from the following plot that actually detecting the character from the input feature vectors may be a challenging task:
Tuning your clustering configurations
The previous examples described how to apply k-means, walked through relevant code, showed how to plot the results of a clustering analysis, and identified appropriate performance metrics. However, when applying k-means to real-world datasets, there are some extra precautions that need to be taken, which we will discuss.
Another critical practical point is how to select an appropriate value for k. Initializing k-means clustering with a specific k value may not be harmful, but in many cases it is not clear initially how many clusters you might find or what values of k may be helpful.
We can rerun the preceding code for multiple values of k in a batch and look at the performance metrics, but this won't tell us which instance of k is most effectively capturing structure within the data. The risk is that as k increases, the Silhouette Coefficient or unexplained variance may decrease dramatically, without meaningful clusters being formed. The extreme case of this would be if k = o, where o is the number of observations in the sample; every point would have its own cluster, the Silhouette Coefficient would be low, but the results wouldn't be meaningful. There are, however, many less extreme cases in which overfitting may occur due to an overly high k value.
To mitigate this risk, it's advisable to use supporting techniques to motivate a selection of k. One useful technique in this context is the
elbow method. The elbow method is a very simple technique; for each instance of k, plot the percentage of explained variance against k. This typically leads to a plot that frequently looks like a bent arm.
For the PCA-reduced dataset, this code looks like the following snippet:
This application of the elbow method takes the PCA
reduction from the previous code sample and applies a test of the explained variance (specifically, a test of the variance within clusters). The result is output as a measure of unexplained variance for each value of k in the range specified. In this case, as we're using the digits
dataset (which we know to have ten classes), the range
specified was 1
to 20
:
The elbow method involves selecting the value of k that maximizes explained variance while minimizing K
; that is, the value of k at the crook of the elbow. The technical sense underlying this is that a minimal gain in explained variance at greater values of k is offset by the increasing risk of overfitting.
Elbow plots may be more or less pronounced and the elbow may not always be clearly identifiable. This example shows a more gradual progression than may be observable in other cases with other datasets. It's worth noting that, while we know the number of classes within the dataset to be ten, the elbow method starts to show diminishing returns on k increases almost immediately and the elbow is located at around five classes. This has a lot to do with the substantial overlap between classes, which we saw in previous plots. While there are ten classes, it becomes increasingly difficult to clearly identify more than five or so.
With this in mind, it's worth noting that the elbow method is intended for use as a heuristic rather than as some kind of objective principle. The use of PCA as a preprocess to improve clustering performance also tends to smooth the graph, delivering a more gradual curve than otherwise.
In addition to making use of the elbow method, it can be valuable to look at the clusters themselves, as we did earlier in the chapter, using PCA to reduce the dimensionality of the data. By plotting the dataset and projecting cluster assignation onto the data, it is sometimes very obvious when a k-means implementation has fitted to a local minima or has overfit the data. The following plot demonstrates extreme overfitting of our previous k-means clustering algorithm to the digits
dataset, artificially prompted by using K = 150. In this example, some clusters contain a single observation; there's really no way that this output would generalize to other samples well:
Plotting the elbow function or cluster assignments is quick to achieve and straightforward to interpret. However, we've spoken of these techniques in terms of being heuristics. If a dataset contains a deterministic number of classes, we may not be sure that a heuristic method will deliver generalizable results.
Another drawback is that visual plot checking is a very manual technique, which makes it poorly-suited for production environments or automation. In such circumstances, it's ideal to find a code-based, automatable method. One solid option in this case is
v-fold cross-validation, a widely-used validation technique.
Cross-validation is simple to undertake. To make it work, one splits the dataset into v parts. One of the parts is set aside individually as a test set. The model is trained against the training data, which is all parts except the test set. Let's try this now, again using the digits
dataset:
This code performs some now familiar data loading and preparation and initializes the k-means clustering algorithm. It then defines cv
, the cross-validation parameters. This includes specification of the number of iterations, n_iter
, and the amount of data that should be used in each fold. In this case, we're using 60% of the data samples as training data and 40% as test data.
We then apply the k-means model and cv
parameters that we've specified within the cross-validation scoring function and print the results as scores
. Let's take a look at these scores now:
This output gives us, in order, the adjusted Rand score for cross-validated, k-means++ clustering performed across each of the 10
folds in order. We can see that results do fluctuate between around 0.4
and 0.55
; the earlier ARI score for k-means++ without PCA fell within this range (at 0.465
). What we've created, then, is code that we can incorporate into our analysis in order to check the quality of our clustering automatically on an ongoing basis.
As noted earlier in this chapter, your choice of success measure is contingent on what information you already have. In most cases, you won't have access to ground truth labels from a dataset and will be obliged to use a measure such as the Silhouette Coefficient that we discussed previously.
Note
Sometimes, even using both cross-validation and visualizations won't provide a conclusive result. Especially with unfamiliar datasets, it's not unheard of to run into issues where some noise or secondary signal resolves better at a different k value than the signal you're attempting to analyze.
As with every other algorithm discussed in this book, it is imperative to understand the dataset one wishes to work with. Without this insight, it's entirely possible for even a technically correct and rigorous analysis to deliver inappropriate conclusions. Chapter 6, Text Feature Engineering will discuss principles and techniques for the inspection and preparation of unfamiliar datasets more thoroughly.