Clustering divides a dataset into clusters. This is an unsupervised learning task since we typically don't have any labels. In the most realistic cases, the complexity is so high that we are not able to find the best division in clusters; however, we can usually find a decent approximation. The clustering analysis task requires a distance function, which indicates how close items are to each other. A common distance is Euclidean distance, which is the distance as a bird flies. Another common distance is taxicab distance, which measures distance in city blocks. Clustering was first used in the 1930s by social science researchers without modern computers.
Clustering can be hard or soft. In hard clustering, an item belongs to only to a cluster, while in soft clustering, an item can belong to multiple clusters with varying probabilities. In this book, I have used only the hard clustering method.
We can...