Summary
In this chapter, we presented some fundamental clustering algorithms. We started with KNN, which is an instance-based method that restructures the dataset to find the most similar samples given a query point. We discussed three approaches: a naive one, which is also the most expensive in terms of computational complexity, and two strategies based respectively on the construction of a KD Tree and a Ball Tree. These two data structures can dramatically improve performance even when the number of samples is very large.
The next topic was a classic algorithm: K-means, which is a symmetric partitioning strategy, comparable to a Gaussian mixture with variances close to zero, that can solve many real-life problems. We discussed both a vanilla algorithm, which wasn't able to find a valid sub-optimal solution, and an optimized initialization method, called K-means++, which was able to speed up the convergence towards solutions quite close to the global minimum. In the same section, we also...