Encoding categorical features with medium or high cardinality
When we are working with a categorical feature that has many unique values, say 15 or more, it can be impractical to create a dummy variable for each value. When there is high cardinality, a very large number of unique values, there may be too few observations with certain values to provide much information for our models. At the extreme, with an ID variable, there is just one observation for each value.
There are a couple of ways to handle medium or high cardinality. One is to create dummies for the top k categories and group the remaining values into an other category. Another is to use feature hashing, also known as the hashing trick. We will explore both strategies in this recipe.
Getting ready
We continue to use the OneHotEncoder
from feature_engine
in this recipe. We will also use the HashingEncoder
from category_encoders
. We will be working with COVID-19 data in this recipe, which has total cases and...