Assessing the performance of a clustering method
Without knowing the true labels, we cannot use the metrics introduced in the previous chapter. In this recipe, we will introduce three measures that will help us assess the effectiveness of our clustering methods: Davis-Bouldin, Pseudo-F (sometimes referred to as Calinski-Harabasz), and Silhouette Score are internal evaluation metrics. In contrast, if we knew the true labels, we could use a range of measures, such as Adjusted Rand Index, Homogeneity, or Completeness scores, to name a few.
Note
Refer to the documentation of Scikit
on clustering methods for a deeper overview of various external evaluation metrics of clustering methods:
http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
For a list of internal clustering validation methods, refer to http://datamining.rutgers.edu/publication/internalmeasures.pdf.
Getting ready
To execute this recipe, you will need pandas
, NumPy
, and Scikit
. No other prerequisites...