Feature binning
Sometimes, we will want to convert a continuous feature into a categorical feature. The process of creating k equally spaced intervals from the minimum to the maximum value of a distribution is called binning or, the somewhat less-friendly term, discretization. Binning can address several important issues with a feature: skew, excessive kurtosis, and the presence of outliers.
Equal-width and equal-frequency binning
Binning might be a good choice with the COVID case data. Let's try that (this might also be useful with other variables in the dataset, including total deaths and population, but we will only work with total cases for now. total_cases
is the target variable in the following code, so it is a column – the only column – on the y_train
DataFrame):
- First, we need to import
EqualFrequencyDiscretiser
andEqualWidthDiscretiser
fromfeature_engine
. Additionally, we need to create training and testing DataFrames from the COVID data...