Grouping rare or infrequent categories
Rare categories are those present only in a small fraction of the observations. There is no rule of thumb to determine how small a small fraction is, but typically, any value below 5% can be considered rare.
Infrequent labels often appear only on the train set or only on the test set, thus making the algorithms prone to overfitting or being unable to score an observation. In addition, when encoding categories to numbers, we only create mappings for those categories observed in the train set, so we won’t know how to encode new labels. To avoid these complications, we can group infrequent categories into a single category called Rare
or Other
.
In this recipe, we will group infrequent categories using pandas
and feature-engine
.
How to do it...
First, let’s import the necessary Python libraries and get the dataset ready:
- Import the necessary Python libraries, functions, and classes:
import numpy as np import pandas...