You're reading from Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Product type Paperback

Published in Oct 2022

Publisher Packt

ISBN-13 9781804611302

Length 386 pages

Edition 2nd Edition

Languages

Python

Tools

Combine

Concepts

Machine Learning

Author (1):

Soledad Galli

View More author details

Table of Contents (14) Chapters

Preface

1. Chapter 1: Imputing Missing Data

2. Chapter 2: Encoding Categorical Variables FREE CHAPTER

3. Chapter 3: Transforming Numerical Variables

4. Chapter 4: Performing Variable Discretization

5. Chapter 5: Working with Outliers

6. Chapter 6: Extracting Features from Date and Time Variables

7. Chapter 7: Performing Feature Scaling

8. Chapter 8: Creating New Features

9. Chapter 9: Extracting Features from Relational Data with Featuretools

10. Chapter 10: Creating Features from a Time Series with tsfresh

11. Chapter 11: Extracting Features from Text Variables

12. Index

Why subscribe?

13. Other Books You May Enjoy

Performing one-hot encoding of frequent categories

One-hot encoding represents each variable’s category with a binary variable. Hence, one-hot encoding of highly cardinal variables or datasets with multiple categorical features can expand the feature space dramatically. This, in turn, may increase the computational cost of using machine learning models or deteriorate their performance. To reduce the number of binary variables, we can perform one-hot encoding of the most frequent categories. One-hot encoding the top categories is equivalent to treating the remaining, less frequent categories as a single, unique category.

In this recipe, we will implement one-hot encoding of the most popular categories using pandas and Feature-engine.

How to do it...

First, let’s import the necessary Python libraries and get the dataset ready:

Import the required Python libraries, functions, and classes:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from feature_engine.encoding import OneHotEncoder

Let’s load the dataset and divide it into train and test sets:

data = pd.read_csv("credit_approval_uci.csv")
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(
        labels=["target"], axis=1),
    data["target"],
    test_size=0.3,
    random_state=0,
)

Tip

The most frequent categories need to be determined in the train set. This is to avoid data leakage.

Let’s inspect the unique categories of the A6 variable:
```
X_train["A6"].unique()
```

The unique values of A6 are displayed in the following output:

array(['c', 'q', 'w', 'ff', 'm', 'i', 'e', 'cc', 'x', 'd', 'k', 'j', 'Missing, 'aa', 'r'], dtype=object)

Let’s count the number of observations per category of A6, sort them in decreasing order, and then display the five most frequent categories:
```
X_train["A6"].value_counts().sort_values(
    ascending=False).head(5)
```

We can see the five most frequent categories and the number of observations per category in the following output:

c     93 q     56 w     48 i     41 ff    38
Name: A6, dtype: int64

Now, let’s capture the most frequent categories of A6 in a list by using the code in step 4 inside a list comprehension:

top_5 = [
    x for x in X_train["A6"].value_counts().sort_values(
        ascending=False).head(5).index
]

Now, let’s add a binary variable per top category to the train and test sets:

for label in top_5:
    X_train[f"A6_{label}"] = np.where(
        X_train["A6"] ==label, 1, 0)
    X_test[f"A6_{label}"] = np.where(
        X_test["A6"] ==label, 1, 0)

Let’s display the top 10 rows of the original and encoded variable, A6, in the train set:
```
X_train[["A6"] + [f"A6_{label}" for label in top_5]].head(10)
```

In the output of step 7, we can see the A6 variable, followed by the binary variables:

    A6  A6_c  A6_q  A6_w  A6_i  A6_ff 596   c     1     0     0     0      0 303   q     0     1     0     0      0 204   w     0     0     1     0      0 351  ff     0     0     0     0      1 118   m     0     0     0     0      0 247   q     0     1     0     0      0 652   i     0     0     0     1      0 513   e     0     0     0     0      0 230  cc     0     0     0     0      0 250   e     0     0     0     0      0

We can automate one-hot encoding of frequent categories with Feature-engine. First, let’s load and divide the dataset, as we did in step 2.

Let’s set up the one-hot encoder to encode the five most frequent categories of the A6 and A7 variables:
```
ohe_enc = OneHotEncoder(
    top_categories=5,
    variables=["A6", "A7"]
)
```

Tip

Feature-engine’s OneHotEncoder() will encode all categorical variables in the dataset by default unless we specify the variables to encode, as we did in step 8.

Let’s fit the encoder to the train set so that it learns and stores the most frequent categories of A6 and A7:
```
ohe_enc.fit(X_train)
```

Note

The number of frequent categories to encode is arbitrarily determined by the user.

Finally, let’s encode A6 and A7 in the train and test sets:

X_train_enc = ohe_enc.transform(X_train)
X_test_enc = ohe_enc.transform(X_test)

You can view the new binary variables in the DataFrame by executing X_train_enc.head(). You can also find the top five categories learned by the encoder by executing ohe_enc.encoder_dict_.

Note

Feature-engine replaces the original variable with the binary ones returned by one-hot encoding, leaving the dataset ready to use in machine learning.

How it works...

In this recipe, we performed one-hot encoding of the five most popular categories using NumPy and Feature-engine.

In the first part of this recipe, we worked with the A6 categorical variable. We inspected its unique categories with pandas unique(). Next, we counted the number of observations per category using pandas value_counts(),which returned a pandas series with the categories as the index and the number of observations as values. Next, we sorted the categories from the one with the most to the one with the least observations using pandas sort_values(). Next, we reduced the series to the five most popular categories by using pandas head(). Then, we used this series in a list comprehension to capture the name of the most frequent categories. After that, we looped over each category, and with NumPy’s where() method, we created binary variables by placing a value of 1 if the observation showed the category, or 0 otherwise.

To perform a one-hot encoding of the five most popular categories of the A6 and A7 variables with Feature-engine, we used OneHotEncoder(), indicating 5 in the top_categories argument, and passing the variable names in a list to the variables argument. With fit(), the encoder learned the top categories from the train set and stored them in its encoder_dict_ attribute. Then, with transform(), OneHotEncoder() replaced the original variables with the set of binary ones.

There’s more...

This recipe is based on the winning solution of the KDD 2009 cup, Winning the KDD Cup Orange Challenge with Ensemble Selection (http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf), where the authors limited one-hot encoding to the 10 most frequent categories of each variable.

You're reading from Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Table of Contents (14) Chapters

Performing one-hot encoding of frequent categories

How to do it...

How it works...

There’s more...

Authors (1)

Personalised recommendations for you