In this chapter, we explained the fundamental concepts of cluster analysis, starting from the concept of similarity and how to measure it. We discussed the K-means algorithm and its optimized variant called K-means++ and we analyzed the Breast Cancer Wisconsin dataset. Then we discussed the most important evaluation metrics (with or without knowledge of the ground truth) and we have learned which factors can influence performance. The next two topics were KNN, a very famous algorithm that can be employed to find the most similar samples given a query vector, and VQ, a technique that exploits clustering algorithms in order to find a lossy representation of a sample (for example, an image) or a dataset.
In the next chapter, we are going to introduce some of the most important advanced clustering algorithms, showing how they can easily solve non-convex problems.
...