Evaluating the Performance of Clusters
After applying a clustering algorithm, it is necessary to evaluate how well the algorithm has performed. This is especially important when it is difficult to visually evaluate the clusters; for example, when there are several features.
Usually, with supervised algorithms, it is easy to evaluate their performance by simply comparing the prediction of each instance with its true value (class). On the other hand, when dealing with unsupervised models (such as clustering algorithms), it is necessary to pursue other strategies.
In the specific case of clustering algorithms, it is possible to evaluate performance by measuring the similarity of the data points that belong to the same cluster.
Available Metrics in Scikit-Learn
Scikit-learn allows its users to use three different scores for evaluating the performance of unsupervised clustering algorithms. The main idea behind these scores is to measure how well-defined the cluster's edges are, instead of measuring the dispersion within a cluster. Hence, it is worth mentioning that the scores do not take into account the size of each cluster.
The two most commonly used scores for measuring unsupervised clustering tasks are explained as follows:
- The Silhouette Coefficient Score calculates the mean distance between each point and all the other points of a cluster (a), as well as the mean distance between each point and all the other points of its nearest clusters (b). It relates both of them according to the following equation:
s = (b - a) / max(a,b)
The result of the score is a value between -1 and 1. The lower the value, the worse the performance of the algorithm. Values around 0 will imply overlapping of clusters. It is also important to clarify that this score does not work very well when using density-based algorithms such as DBSCAN.
- The Calinski–Harabasz Index was created to measure the relationship between the variance of each cluster and the variance of all clusters. More specifically, the variance of each cluster is the mean square error of each point with respect to the centroid of that cluster. On the other hand, the variance of all clusters refers to the overall inter-cluster variance.
The higher the value of the Calinski–Harabasz Index, the better the definition and separation of the clusters. There is no acceptable cut-off value, so the performance of the algorithms using this index is evaluated through comparison, where the algorithm with the highest value is the one that performs best. As with the Silhouette Coefficient, this score does not perform well on density-based algorithms such as DBSCAN.
Unfortunately, the scikit-learn library does not contain other methods for effectively measuring the performance of density-based clustering algorithms, and although the methods mentioned here may work in some cases to measure the performance of these algorithms, when they do not, there is no other way to measure this other than via manual evaluation.
However, it is worth mentioning that there are additional performance measures in scikit-learn for cases where a ground truth label is known, known as supervised clustering; for instance, when performing clustering over a set of observations of journalism students who have already signed up for a major or a specialization area. If we were to use their demographic information as well as some student records to categorize them into clusters that represent their choice of major, it would be possible to compare the predicted classification with the actual classification.
Some of these measures are as follows:
- Homogeneity score: This score is based on the premise that a clustering task is homogenous if all clusters only contain data points that belong to a single class label. The output from the score is a number between 0 and 1, with 1 being a perfectly homogeneous labeling. The score is part of scikit-learn's
metrics
module, and it receives the list of ground truth clusters and the list of predicted clusters as inputs, as follows:from sklearn.metrics import homogeneity_score score = homogeneity_score(true_labels, predicted_labels)
- Completeness score: Opposite to the homogeneity score, a clustering task satisfies completeness if all data points that belong to a given class label belong to the same cluster. Again, the output measure is a number between 0 and 1, with 1 being the output for perfect completeness. This score is also part of scikit-learn's metrics modules, and it also receives the ground truth labels and the predicted ones as inputs, as follows:
from sklearn.metrics import completeness_score score = completeness_score(true_labels, predicted_labels)
Note
To explore other measures that evaluate the performance of supervised clustering tasks, visit the following URL, under the clustering section: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.
Exercise 2.05: Evaluating the Silhouette Coefficient Score and Calinski–Harabasz Index
In this exercise, we will learn how to calculate the two scores we discussed in the previous section that are available in scikit-learn. Perform the following steps to complete this exercise:
- Import the Silhouette Coefficient score and the Calinski-Harabasz Index from the scikit-learn library:
from sklearn.metrics import silhouette_score from sklearn.metrics import calinski_harabasz_score
- Calculate the Silhouette Coefficient score for each of the algorithms we modeled in all of the previous exercises. Use the Euclidean distance as the metric for measuring the distance between points.
The input parameters of the
silhouette_score()
function are the data, the predicted values of the model (the clusters assigned to each data point), and the distance measure:Note
The code snippet shown here uses a backslash (
\
) to split the logic across multiple lines. When the code is executed, Python will ignore the backslash, and treat the code on the next line as a direct continuation of the current line.kmeans_score = silhouette_score(data, pred_kmeans, \                                 metric='euclidean') meanshift_score = silhouette_score(data, pred_meanshift, \                                    metric='euclidean') dbscan_score = silhouette_score(data, pred_dbscan, \                                 metric='euclidean') print(kmeans_score, meanshift_score, dbscan_score)
The first three lines call the
silhouette_score()
function over each of the models (the k-mean, the mean-shift, and the DBSCAN algorithms) by inputting the data, the predictions, and the distance measure. The last line of code prints out the score for each of the models.The scores come to be around
0.359
,0.3705
, and0.1139
for the k-means, mean-shift, and DBSCAN algorithms, respectively.You can observe that both k-means and mean-shift algorithms have similar scores, while the DBSCAN score is closer to zero. This can indicate that the performance of the first two algorithms is much better, and hence, the DBSCAN algorithm should not be considered to solve the data problem.
Nevertheless, it is important to remember that this type of score does not perform well when evaluating the DBSCAN algorithm. This is basically because as one cluster is surrounding the other one, the score can interpret that as an overlap when, in reality, the clusters are very well-defined, as is the case of the current dataset.
- Calculate the Calinski-Harabasz index for each of the algorithms we modeled in the previous exercises in this chapter. The input parameters of the
calinski_harabasz_score()
function are the data and the predicted values of the model (the clusters assigned to each data point):kmeans_score = calinski_harabasz_score(data, pred_kmeans) meanshift_score = calinski_harabasz_score(data, pred_meanshift) dbscan_score = calinski_harabasz_score(data, pred_dbscan) print(kmeans_score, meanshift_score, dbscan_score)
Again, the first three lines apply the
calinski_harabasz_score()
function over the three models by passing the data and the prediction as inputs. The last line prints out the results.The values come to approximately
1379.7
,1305.14
, and0.0017
for the k-means, mean-shift, and DBSCAN algorithms, respectively. Once again, the results are similar to the ones we obtained using the Silhouette Coefficient score, where both the k-means and mean-shift algorithms performed similarly well, while the DBSCAN algorithm did not.Moreover, it is worth mentioning that the scale of each method (the Silhouette Coefficient score and the Calinski-Harabasz index) differs significantly, so they are not easily comparable.
Note
To access the source code for this specific section, please refer to https://packt.live/3e3YIif.
You can also run this example online at https://packt.live/2MXOQdZ. You must execute the entire Notebook in order to get the desired result.
You have successfully measured the performance of three different clustering algorithms.
In conclusion, the scores presented in this topic are a way of evaluating the performance of clustering algorithms. However, it is important to consider that the results from these scores are not definitive as their performance varies from algorithm to algorithm.
Activity 2.05: Measuring and Comparing the Performance of the Algorithms
You might find yourself in a situation in which you are not sure about the performance of the algorithms as it cannot be evaluated graphically. Therefore, you will have to measure the performance of the algorithms using numerical metrics that can be used to make comparisons. For the previously trained models, calculate the Silhouette Coefficient score and the Calinski-Harabasz index to measure the performance of the algorithms. The following steps provide hints regarding how you can do this:
- Open the Jupyter Notebook that you used for the previous activity.
- Calculate both the Silhouette Coefficient score and the Calinski-Harabasz index for all of the models that you trained previously.
The results may differ based on the choices you made during the development of the previous activities and how you initialized certain parameters in each algorithm. Nevertheless, the following results can be expected for a k-means algorithm set to divide the dataset into six clusters, a mean-shift algorithm with a bandwidth equal to 0.4, and a DBSCAN algorithm with an epsilon score of 0.8:
Silhouette Coefficient K-means = 0.3515 mean-shift = 0.0933 DBSCAN = 0.1685 Calinski-Harabasz Index K-means = 145.73 mean-shift = 112.90 DBSCAN = 42.45
Note
The solution for this activity can be found via this link.