You're reading from Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Product type Paperback

Published in Oct 2022

Publisher Packt

ISBN-13 9781804611302

Length 386 pages

Edition 2nd Edition

Languages

Python

Tools

Combine

Concepts

Machine Learning

Author (1):

Soledad Galli

View More author details

Table of Contents (14) Chapters

Preface

1. Chapter 1: Imputing Missing Data

2. Chapter 2: Encoding Categorical Variables FREE CHAPTER

3. Chapter 3: Transforming Numerical Variables

4. Chapter 4: Performing Variable Discretization

5. Chapter 5: Working with Outliers

6. Chapter 6: Extracting Features from Date and Time Variables

7. Chapter 7: Performing Feature Scaling

8. Chapter 8: Creating New Features

9. Chapter 9: Extracting Features from Relational Data with Featuretools

10. Chapter 10: Creating Features from a Time Series with tsfresh

11. Chapter 11: Extracting Features from Text Variables

12. Index

Why subscribe?

13. Other Books You May Enjoy

Grouping rare or infrequent categories

Rare categories are those present only in a small fraction of the observations. There is no rule of thumb to determine how small a small fraction is, but typically, any value below 5% can be considered rare.

Infrequent labels often appear only on the train set or only on the test set, thus making the algorithms prone to overfitting or being unable to score an observation. In addition, when encoding categories to numbers, we only create mappings for those categories observed in the train set, so we won’t know how to encode new labels. To avoid these complications, we can group infrequent categories into a single category called Rare or Other.

In this recipe, we will group infrequent categories using pandas and Feature-engine.

How to do it...

First, let’s import the necessary Python libraries and get the dataset ready:

Import the necessary Python libraries, functions, and classes:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.categorical_encoders import RareLabelEncoder

Let’s load the dataset and divide it into train and test sets:

data = pd.read_csv("credit_approval_uci.csv")
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=["target"], axis=1),
    data["target"],
    test_size=0.3,
    random_state=0,
)

Let’s capture the fraction of observations per category in A7 in a variable:
```
freqs = X_train["A7"].value_counts(normalize=True)
```

We can see the percentage of observations per category of A7, expressed as decimals, in the following output after executing print(freqs):

v	0.573499
h	0.209110
ff	0.084886
bb	0.080745
z	0.014493
dd	0.010352
j	0.010352
Missing	0.008282
n	0.006211
o	0.002070
Name: A7, dtype: float64

If we consider those labels present in less than 5% of the observations as rare, then z, dd, j, Missing, n, and o are rare categories.

Let’s create a list containing the names of the categories present in more than 5% of the observations:
```
frequent_cat = [
    x for x in freqs.loc[freqs > 0.05].index.values]
```

If we execute print(frequent_cat), we will see the frequent categories of A7:

['v', 'h', 'ff', 'bb'].

Let’s replace rare labels – that is, those present in <= 5% of the observations – with the "Rare" string:

X_train["A7"] = np.where(
    X_train["A7"].isin(frequent_cat),
    X_train["A7"], "Rare"
)
X_test["A7"] = np.where(
    X_test["A7"].isin(frequent_cat),
    X_test["A7"], "Rare"
    )

Let’s determine the percentage of observations in the encoded variable:
```
X_train["A7"].value_counts(normalize=True)
```

We can see that the infrequent labels have now been re-grouped into the Rare category:

v       0.573499 h       0.209110 ff      0.084886 bb      0.080745 Rare    0.051760 Name: A7, dtype: float64

Now, let’s group rare labels using Feature-engine. First, we must divide the dataset into train and test sets, as we did in step 2.

Let’s create a rare label encoder that groups categories present in less than 5% of the observations, provided that the categorical variable has more than four distinct values:
```
rare_encoder = RareLabelEncoder(tol=0.05, n_categories=4)
```
Let’s fit the encoder so that it finds the categorical variables and then learns their most frequent categories:
```
rare_encoder.fit(X_train)
```

Tip

Upon fitting, the transformer will raise warnings, indicating that many categorical variables have less than four categories, thus their values will not be grouped. The transformer just lets you know that this is happening.

We can display the frequent categories per variable by executing rare_encoder.encoder_dict_, as well as the variables that will be encoded by executing rare_encoder.variables_.

Finally, let’s group rare labels in the train and test sets:

X_train_enc = rare_encoder.transform(X_train)
X_test_enc = rare_encoder.transform(X_test)

Now that we have grouped rare labels, we are ready to encode the categorical variables, as we’ve done in other recipes in this chapter.

How it works...

In this recipe, we grouped infrequent categories using pandas and Feature-engine.

We determined the fraction of observations per category of the A7 variable using pandas value_counts() by setting the normalize parameter to True. Using list comprehension, we captured the names of the variables present in more than 5% of the observations. Finally, using NumPy’s where(), we searched each row of A7, and if the observation was one of the frequent categories in the list, which we checked using the pandas isin() method, its value was kept; otherwise, its original value was replaced with "Rare".

We automated the preceding steps for multiple categorical variables using Feature-engine. For this, we used Feature-engine’s RareLabelEncoder(). By setting tol to 0.05, we retained categories present in more than 5% of the observations. By setting n_categories to 4, we only group rare categories in variables with more than four unique values. With the fit() method, the transformer identified the categorical variables and then learned and stored their frequent categories. With the transform() method, the transformer replaced infrequent categories with the "Rare" string.

You're reading from Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Table of Contents (14) Chapters

Grouping rare or infrequent categories

How to do it...

How it works...

Authors (1)

Personalised recommendations for you