Finding an optimal number of clusters for k-means
Often, you will not know how many clusters you can expect in your data. For two or three-dimensional data, you could plot the dataset in an attempt to eyeball the clusters. However, it becomes harder with a dataset that has many dimensions as, beyond three dimensions, it is impossible to plot the data on one chart.
In this recipe, we will show you how to find the optimal number of clusters for a k-means clustering model. We will be using the Davis-Bouldin metric to assess the performance of our k-means models when we vary the number of clusters. The aim is to stop when a minimum of the metric is found.
Getting ready
In order to execute this, you will need pandas
, NumPy
, and Scikit
. No other prerequisites are required.
How to do it…
In order to find the optimal number of clusters, we developed the findOptimalClusterNumber(...)
method. The overall algorithm of estimating the k-means model has not changed—instead of calling findClusters_kmeans...