Advanced Clustering and Unsupervised Models
In this chapter, we will continue to analyze clustering algorithms, focusing our attention on more complex models that can solve problems where K-means fails. These algorithms are extremely helpful in specific contexts (for example, geographical segmentation) where the structure of the data is highly non-linear and any approximation leads to a substantial drop in performance.
In particular, the algorithms and the topics we are going to analyze are:
- Fuzzy C-means
- Spectral clustering based on the Shi-Malik algorithm
- DBSCAN, including the Calinski-Harabasz and Davies-Bouldin scores
The first model is Fuzzy C-means, which is an extension of K-means to a soft-labeling scenario. Just like Generative Gaussian Mixtures, the algorithm helps the data scientist to understand the pseudo-probability (a measure similar to an actual probability) of a data point belonging to all defined clusters.