Choosing the Number of Clusters
In the previous sections, we saw how easy it is to fit the k-means algorithm on a given dataset. In our ATO dataset, we found 8 different clusters that were mainly defined by the values of the Average net tax
variable.
But you may have asked yourself: "Why 8 clusters? Why not 3 or 15 clusters?" These are indeed excellent questions. The short answer is that we used k-means' default value for the hyperparameter n_cluster
, defining the number of clusters to be found, as 8.
As you will recall from Chapter 2, Regression, and Chapter 4, Multiclass Classification with RandomForest, the value of a hyperparameter isn't learned by the algorithm but has to be set arbitrarily by you prior to training. For k-means, n_cluster
is one of the most important hyperparameters you will have to tune. Choosing a low value will lead k-means to group many data points together, even though they are very different from each other. On the other hand,...