Grouping rare or infrequent categories
Rare categories are those present only in a small fraction of the observations. There is no rule of thumb to determine how small a small fraction is, but typically, any value below 5% can be considered rare.
Infrequent labels often appear only on the train set or only on the test set, thus making the algorithms prone to overfitting or being unable to score an observation. In addition, when encoding categories to numbers, we only create mappings for those categories observed in the train set, so we won’t know how to encode new labels. To avoid these complications, we can group infrequent categories into a single category called Rare or Other.
In this recipe, we will group infrequent categories using pandas and Feature-engine.
How to do it...
First, let’s import the necessary Python libraries and get the dataset ready:
- Import the necessary Python libraries, functions, and classes:
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from feature_engine.categorical_encoders import RareLabelEncoder
- Let’s load the dataset and divide it into train and test sets:
data = pd.read_csv("credit_approval_uci.csv") X_train, X_test, y_train, y_test = train_test_split( data.drop(labels=["target"], axis=1), data["target"], test_size=0.3, random_state=0, )
- Let’s capture the fraction of observations per category in
A7
in a variable:freqs = X_train["A7"].value_counts(normalize=True)
We can see the percentage of observations per category of A7
, expressed as decimals, in the following output after executing print(freqs)
:
v 0.573499 h 0.209110 ff 0.084886 bb 0.080745 z 0.014493 dd 0.010352 j 0.010352 Missing 0.008282 n 0.006211 o 0.002070 Name: A7, dtype: float64
If we consider those labels present in less than 5% of the observations as rare, then z
, dd
, j
, Missing
, n
, and o
are rare categories.
- Let’s create a list containing the names of the categories present in more than 5% of the observations:
frequent_cat = [ x for x in freqs.loc[freqs > 0.05].index.values]
If we execute print(frequent_cat)
, we will see the frequent categories of A7
:
['v', 'h', 'ff', 'bb'].
- Let’s replace rare labels – that is, those present in <= 5% of the observations – with the
"
Rare"
string:X_train["A7"] = np.where( X_train["A7"].isin(frequent_cat), X_train["A7"], "Rare" ) X_test["A7"] = np.where( X_test["A7"].isin(frequent_cat), X_test["A7"], "Rare" )
- Let’s determine the percentage of observations in the encoded variable:
X_train["A7"].value_counts(normalize=True)
We can see that the infrequent labels have now been re-grouped into the Rare
category:
v 0.573499 h 0.209110 ff 0.084886 bb 0.080745 Rare 0.051760 Name: A7, dtype: float64
Now, let’s group rare labels using Feature-engine. First, we must divide the dataset into train and test sets, as we did in step 2.
- Let’s create a rare label encoder that groups categories present in less than 5% of the observations, provided that the categorical variable has more than four distinct values:
rare_encoder = RareLabelEncoder(tol=0.05, n_categories=4)
- Let’s fit the encoder so that it finds the categorical variables and then learns their most frequent categories:
rare_encoder.fit(X_train)
Tip
Upon fitting, the transformer will raise warnings, indicating that many categorical variables have less than four categories, thus their values will not be grouped. The transformer just lets you know that this is happening.
We can display the frequent categories per variable by executing rare_encoder.encoder_dict_
, as well as the variables that will be encoded by executing rare_encoder.variables_
.
- Finally, let’s group rare labels in the train and test sets:
X_train_enc = rare_encoder.transform(X_train) X_test_enc = rare_encoder.transform(X_test)
Now that we have grouped rare labels, we are ready to encode the categorical variables, as we’ve done in other recipes in this chapter.
How it works...
In this recipe, we grouped infrequent categories using pandas and Feature-engine.
We determined the fraction of observations per category of the A7
variable using pandas value_counts()
by setting the normalize
parameter to True
. Using list comprehension, we captured the names of the variables present in more than 5% of the observations. Finally, using NumPy’s where()
, we searched each row of A7
, and if the observation was one of the frequent categories in the list, which we checked using the pandas isin()
method, its value was kept; otherwise, its original value was replaced with "Rare"
.
We automated the preceding steps for multiple categorical variables using Feature-engine. For this, we used Feature-engine’s RareLabelEncoder()
. By setting tol
to 0.05
, we retained categories present in more than 5% of the observations. By setting n_categories
to 4
, we only group rare categories in variables with more than four unique values. With the fit()
method, the transformer identified the categorical variables and then learned and stored their frequent categories. With the transform()
method, the transformer replaced infrequent categories with the "
Rare"
string.