Initializing Clusters
Since the beginning of this chapter, we've been referring to k-means every time we've fitted our clustering algorithms. But you may have noticed in each model summary that there was a hyperparameter called init
with the default value as k-means++. We were, in fact, using k-means++ all this time.
The difference between k-means and k-means++ is in how they initialize clusters at the start of the training. k-means randomly chooses the center of each cluster (called the centroid) and then assigns each data point to its nearest cluster. If this cluster initialization is chosen incorrectly, this may lead to non-optimal grouping at the end of the training process. For example, in the following graph, we can clearly see the three natural groupings of the data, but the algorithm didn't succeed in identifying them properly:
k-means++ is an attempt to find better clusters...