In this section, we will look at applying the k-means clustering algorithm. We will learn about the k-means algorithm, and demonstrate how it's used.
When clustering with k-means, we start with a dataset we want to cluster, as seen here:
We choose the initial cluster centers. This is an important step, as badly chosen centers can lead to bad clusters, as shown in the following diagram:
The default options for the KMeans class in scikit-learn, however, helps you to avoid the problems associated with badly chosen starting-cluster centers. I won't go into the details of how this class does this. In this section, all I'm going to do is choose a random subset of the dataset to serve as the initial cluster points. This is not necessarily the best approach, and you probably shouldn't deviate from what the class is doing by default...