Performing one-hot encoding of frequent categories
One-hot encoding represents each variable’s category with a binary variable. Hence, one-hot encoding of highly cardinal variables or datasets with multiple categorical features can expand the feature space dramatically. This, in turn, may increase the computational cost of using machine learning models or deteriorate their performance. To reduce the number of binary variables, we can perform one-hot encoding of the most frequent categories. One-hot encoding the top categories is equivalent to treating the remaining, less frequent categories as a single, unique category.
In this recipe, we will implement one-hot encoding of the most popular categories using pandas
, Scikit-learn
, and feature-engine
.
How to do it...
First, let’s import the necessary Python libraries and get the dataset ready:
- Import the required Python libraries, functions, and classes:
import pandas as pd import numpy as np from sklearn...