Summary
In this chapter, we presented some fundamental clustering algorithms. We started with KNN, which is an instance-based method that restructures the dataset to find the most similar samples to a given query point. We discussed three approaches: a naive one, which is also the most expensive in terms of computational complexity, and two strategies based respectively on the construction of a k-d tree and a ball tree. These two data structures can dramatically improve performance even when the number of samples is very large.
The next topic was a classic algorithm: K-means, which is a symmetric partitioning strategy comparable to a Gaussian mixture with variances close to zero that can solve many real-life problems. We discussed both a vanilla algorithm, which couldn't find a valid sub-optimal solution, and an optimized initialization method, called K-means++, which was able to speed up the convergence toward solutions quite close to the global minimum.
We also presented...