Using k-means clustering
The k-means clustering method is widespread in data science, in part because it is simple to use and understand.
In this method, we have one primary hyperparameter, k, which determines the number of clusters. The algorithm works like so:
- Initialize cluster centers: randomly select k points in the feature space as our cluster centroids.
- Calculate the distance of points to each cluster center with a metric such as Euclidean distance, and then assign points to the nearest cluster.
- Readjust cluster centers based on average values of points in each cluster.
- Repeat steps 2 and 3 until the cluster centers change by a small amount (or not at all).
The initialization of the cluster centers can be done in an intelligent way to speed up convergence – we can initialize cluster centers so that they are far away from each other. The other steps are simple and easy to code by hand, but we can use k-means clustering easily...