Chapter 10 – Clustering News Articles
Evaluation
The evaluation of clustering algorithms is a difficult problem—on the one hand, we can sort of tell what good clusters look like; on the other hand, if we really know that, we should label some instances and use a supervised classifier! Much has been written on this topic. One slideshow on the topic that is a good introduction to the challenges follows:
http://www.cs.kent.edu/~jin/DM08/ClusterValidation.pdf
In addition, a very comprehensive (although now a little dated) paper on this topic is here: http://web.itu.edu.tr/sgunduz/courses/verimaden/paper/validity_survey.pdf.
The scikit-learn package does implement a number of the metrics described in those links, with an overview here: http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation.
Using some of these, you can start evaluating which parameters need to be used for better clusterings. Using a Grid Search, we can find parameters that maximize...