Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Feature Engineering Cookbook

You're reading from   Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Arrow left icon
Product type Paperback
Published in Oct 2022
Publisher Packt
ISBN-13 9781804611302
Length 386 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Soledad Galli Soledad Galli
Author Profile Icon Soledad Galli
Soledad Galli
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Chapter 1: Imputing Missing Data 2. Chapter 2: Encoding Categorical Variables FREE CHAPTER 3. Chapter 3: Transforming Numerical Variables 4. Chapter 4: Performing Variable Discretization 5. Chapter 5: Working with Outliers 6. Chapter 6: Extracting Features from Date and Time Variables 7. Chapter 7: Performing Feature Scaling 8. Chapter 8: Creating New Features 9. Chapter 9: Extracting Features from Relational Data with Featuretools 10. Chapter 10: Creating Features from a Time Series with tsfresh 11. Chapter 11: Extracting Features from Text Variables 12. Index 13. Other Books You May Enjoy

Performing one-hot encoding of frequent categories

One-hot encoding represents each variable’s category with a binary variable. Hence, one-hot encoding of highly cardinal variables or datasets with multiple categorical features can expand the feature space dramatically. This, in turn, may increase the computational cost of using machine learning models or deteriorate their performance. To reduce the number of binary variables, we can perform one-hot encoding of the most frequent categories. One-hot encoding the top categories is equivalent to treating the remaining, less frequent categories as a single, unique category.

In this recipe, we will implement one-hot encoding of the most popular categories using pandas and Feature-engine.

How to do it...

First, let’s import the necessary Python libraries and get the dataset ready:

  1. Import the required Python libraries, functions, and classes:
    import pandas as pd
    import numpy as np
    from sklearn.model_selection import train_test_split
    from feature_engine.encoding import OneHotEncoder
  2. Let’s load the dataset and divide it into train and test sets:
    data = pd.read_csv("credit_approval_uci.csv")
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop(
            labels=["target"], axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )

Tip

The most frequent categories need to be determined in the train set. This is to avoid data leakage.

  1. Let’s inspect the unique categories of the A6 variable:
    X_train["A6"].unique()

The unique values of A6 are displayed in the following output:

array(['c', 'q', 'w', 'ff', 'm', 'i', 'e', 'cc', 'x', 'd', 'k', 'j', 'Missing, 'aa', 'r'], dtype=object)
  1. Let’s count the number of observations per category of A6, sort them in decreasing order, and then display the five most frequent categories:
    X_train["A6"].value_counts().sort_values(
        ascending=False).head(5)

We can see the five most frequent categories and the number of observations per category in the following output:

c     93 q     56 w     48 i     41 ff    38
Name: A6, dtype: int64
  1. Now, let’s capture the most frequent categories of A6 in a list by using the code in step 4 inside a list comprehension:
    top_5 = [
        x for x in X_train["A6"].value_counts().sort_values(
            ascending=False).head(5).index
    ]
  2. Now, let’s add a binary variable per top category to the train and test sets:
    for label in top_5:
        X_train[f"A6_{label}"] = np.where(
            X_train["A6"] ==label, 1, 0)
        X_test[f"A6_{label}"] = np.where(
            X_test["A6"] ==label, 1, 0)
  3. Let’s display the top 10 rows of the original and encoded variable, A6, in the train set:
    X_train[["A6"] + [f"A6_{label}" for label in top_5]].head(10)

In the output of step 7, we can see the A6 variable, followed by the binary variables:

    A6  A6_c  A6_q  A6_w  A6_i  A6_ff 596   c     1     0     0     0      0 303   q     0     1     0     0      0 204   w     0     0     1     0      0 351  ff     0     0     0     0      1 118   m     0     0     0     0      0 247   q     0     1     0     0      0 652   i     0     0     0     1      0 513   e     0     0     0     0      0 230  cc     0     0     0     0      0 250   e     0     0     0     0      0

We can automate one-hot encoding of frequent categories with Feature-engine. First, let’s load and divide the dataset, as we did in step 2.

  1. Let’s set up the one-hot encoder to encode the five most frequent categories of the A6 and A7 variables:
    ohe_enc = OneHotEncoder(
        top_categories=5,
        variables=["A6", "A7"]
    )

Tip

Feature-engine’s OneHotEncoder() will encode all categorical variables in the dataset by default unless we specify the variables to encode, as we did in step 8.

  1. Let’s fit the encoder to the train set so that it learns and stores the most frequent categories of A6 and A7:
    ohe_enc.fit(X_train)

Note

The number of frequent categories to encode is arbitrarily determined by the user.

  1. Finally, let’s encode A6 and A7 in the train and test sets:
    X_train_enc = ohe_enc.transform(X_train)
    X_test_enc = ohe_enc.transform(X_test)

You can view the new binary variables in the DataFrame by executing X_train_enc.head(). You can also find the top five categories learned by the encoder by executing ohe_enc.encoder_dict_.

Note

Feature-engine replaces the original variable with the binary ones returned by one-hot encoding, leaving the dataset ready to use in machine learning.

How it works...

In this recipe, we performed one-hot encoding of the five most popular categories using NumPy and Feature-engine.

In the first part of this recipe, we worked with the A6 categorical variable. We inspected its unique categories with pandas unique(). Next, we counted the number of observations per category using pandas value_counts(),which returned a pandas series with the categories as the index and the number of observations as values. Next, we sorted the categories from the one with the most to the one with the least observations using pandas sort_values(). Next, we reduced the series to the five most popular categories by using pandas head(). Then, we used this series in a list comprehension to capture the name of the most frequent categories. After that, we looped over each category, and with NumPy’s where() method, we created binary variables by placing a value of 1 if the observation showed the category, or 0 otherwise.

To perform a one-hot encoding of the five most popular categories of the A6 and A7 variables with Feature-engine, we used OneHotEncoder(), indicating 5 in the top_categories argument, and passing the variable names in a list to the variables argument. With fit(), the encoder learned the top categories from the train set and stored them in its encoder_dict_ attribute. Then, with transform(), OneHotEncoder() replaced the original variables with the set of binary ones.

There’s more...

This recipe is based on the winning solution of the KDD 2009 cup, Winning the KDD Cup Orange Challenge with Ensemble Selection (http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf), where the authors limited one-hot encoding to the 10 most frequent categories of each variable.

You have been reading a chapter from
Python Feature Engineering Cookbook - Second Edition
Published in: Oct 2022
Publisher: Packt
ISBN-13: 9781804611302
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image