Rare values are those categories that are present only in a small percentage of the observations. There is no rule of thumb to determine how small is a small percentage, but typically, any value below 5 % can be considered rare. Infrequent labels often appear only on the train set or only on the test set, therefore making the algorithms prone to overfitting or unable to score an observation. To avoid these complications, we can group infrequent categories into a new category called Rare or Other.
For details on how to identify rare labels, visit the Pinpointing rare categories in categorical variables recipe in Chapter 1, Foreseeing Variable Problems in Building ML Models.
In this recipe, we will group infrequent categories using pandas and Feature-engine.