k-means binning
Another option is to use k-means clustering to determine the bins. The k-means algorithm randomly selects k data points as centers of clusters, and then it assigns the other data points to the closest cluster. The mean of each cluster is computed, and the data points are reassigned to the nearest new cluster. This process is repeated until the optimal centers are found.
When k-means is used for binning, all data points in the same cluster will have the same ordinal value.
Getting ready
We will use scikit-learn this time for our binning. Scitkit-learn has a great tool for creating bins based on k-means, KBinsDiscretizer
.
How to do it...
- We start by instantiating a
KBinsDiscretizer
object. We will use it to create bins with the COVID-19 cases data:kbins = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='kmeans', subsample=None) y_train_bins = \ pd.DataFrame(kbins.fit_transform(y_train), columns=[&apos...