Performing one-hot encoding of frequent categories
One-hot encoding represents each variable’s category with a binary variable. Hence, one-hot encoding of highly cardinal variables or datasets with multiple categorical features can expand the feature space dramatically. This, in turn, may increase the computational cost of using machine learning models or deteriorate their performance. To reduce the number of binary variables, we can perform one-hot encoding of the most frequent categories. One-hot encoding the top categories is equivalent to treating the remaining, less frequent categories as a single, unique category.
In this recipe, we will implement one-hot encoding of the most popular categories using pandas and Feature-engine.
How to do it...
First, let’s import the necessary Python libraries and get the dataset ready:
- Import the required Python libraries, functions, and classes:
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from feature_engine.encoding import OneHotEncoder
- Let’s load the dataset and divide it into train and test sets:
data = pd.read_csv("credit_approval_uci.csv") X_train, X_test, y_train, y_test = train_test_split( data.drop( labels=["target"], axis=1), data["target"], test_size=0.3, random_state=0, )
Tip
The most frequent categories need to be determined in the train set. This is to avoid data leakage.
- Let’s inspect the unique categories of the
A6
variable:X_train["A6"].unique()
The unique values of A6
are displayed in the following output:
array(['c', 'q', 'w', 'ff', 'm', 'i', 'e', 'cc', 'x', 'd', 'k', 'j', 'Missing, 'aa', 'r'], dtype=object)
- Let’s count the number of observations per category of
A6
, sort them in decreasing order, and then display the five most frequent categories:X_train["A6"].value_counts().sort_values( ascending=False).head(5)
We can see the five most frequent categories and the number of observations per category in the following output:
c 93 q 56 w 48 i 41 ff 38 Name: A6, dtype: int64
- Now, let’s capture the most frequent categories of
A6
in a list by using the code in step 4 inside a list comprehension:top_5 = [ x for x in X_train["A6"].value_counts().sort_values( ascending=False).head(5).index ]
- Now, let’s add a binary variable per top category to the train and test sets:
for label in top_5: X_train[f"A6_{label}"] = np.where( X_train["A6"] ==label, 1, 0) X_test[f"A6_{label}"] = np.where( X_test["A6"] ==label, 1, 0)
- Let’s display the top
10
rows of the original and encoded variable,A6
, in the train set:X_train[["A6"] + [f"A6_{label}" for label in top_5]].head(10)
In the output of step 7, we can see the A6
variable, followed by the binary variables:
A6 A6_c A6_q A6_w A6_i A6_ff 596 c 1 0 0 0 0 303 q 0 1 0 0 0 204 w 0 0 1 0 0 351 ff 0 0 0 0 1 118 m 0 0 0 0 0 247 q 0 1 0 0 0 652 i 0 0 0 1 0 513 e 0 0 0 0 0 230 cc 0 0 0 0 0 250 e 0 0 0 0 0
We can automate one-hot encoding of frequent categories with Feature-engine. First, let’s load and divide the dataset, as we did in step 2.
- Let’s set up the one-hot encoder to encode the five most frequent categories of the
A6
andA7
variables:ohe_enc = OneHotEncoder( top_categories=5, variables=["A6", "A7"] )
Tip
Feature-engine’s OneHotEncoder()
will encode all categorical variables in the dataset by default unless we specify the variables to encode, as we did in step 8.
- Let’s fit the encoder to the train set so that it learns and stores the most frequent categories of
A6
andA7
:ohe_enc.fit(X_train)
Note
The number of frequent categories to encode is arbitrarily determined by the user.
- Finally, let’s encode
A6
andA7
in the train and test sets:X_train_enc = ohe_enc.transform(X_train) X_test_enc = ohe_enc.transform(X_test)
You can view the new binary variables in the DataFrame by executing X_train_enc.head()
. You can also find the top five categories learned by the encoder by executing ohe_enc.encoder_dict_
.
Note
Feature-engine replaces the original variable with the binary ones returned by one-hot encoding, leaving the dataset ready to use in machine learning.
How it works...
In this recipe, we performed one-hot encoding of the five most popular categories using NumPy and Feature-engine.
In the first part of this recipe, we worked with the A6
categorical variable. We inspected its unique categories with pandas unique()
. Next, we counted the number of observations per category using pandas value_counts()
,which returned a pandas series with the categories as the index and the number of observations as values. Next, we sorted the categories from the one with the most to the one with the least observations using pandas sort_values()
. Next, we reduced the series to the five most popular categories by using pandas head()
. Then, we used this series in a list comprehension to capture the name of the most frequent categories. After that, we looped over each category, and with NumPy’s where()
method, we created binary variables by placing a value of 1
if the observation showed the category, or 0
otherwise.
To perform a one-hot encoding of the five most popular categories of the A6
and A7
variables with Feature-engine, we used OneHotEncoder()
, indicating 5
in the top_categories
argument, and passing the variable names in a list to the variables
argument. With fit()
, the encoder learned the top categories from the train set and stored them in its encoder_dict_
attribute. Then, with transform()
, OneHotEncoder()
replaced the original variables with the set of binary ones.
There’s more...
This recipe is based on the winning solution of the KDD 2009 cup, Winning the KDD Cup Orange Challenge with Ensemble Selection (http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf), where the authors limited one-hot encoding to the 10 most frequent categories of each variable.