Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Feature Engineering Cookbook

You're reading from   Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Arrow left icon
Product type Paperback
Published in Oct 2022
Publisher Packt
ISBN-13 9781804611302
Length 386 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Soledad Galli Soledad Galli
Author Profile Icon Soledad Galli
Soledad Galli
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Chapter 1: Imputing Missing Data 2. Chapter 2: Encoding Categorical Variables FREE CHAPTER 3. Chapter 3: Transforming Numerical Variables 4. Chapter 4: Performing Variable Discretization 5. Chapter 5: Working with Outliers 6. Chapter 6: Extracting Features from Date and Time Variables 7. Chapter 7: Performing Feature Scaling 8. Chapter 8: Creating New Features 9. Chapter 9: Extracting Features from Relational Data with Featuretools 10. Chapter 10: Creating Features from a Time Series with tsfresh 11. Chapter 11: Extracting Features from Text Variables 12. Index 13. Other Books You May Enjoy

Grouping rare or infrequent categories

Rare categories are those present only in a small fraction of the observations. There is no rule of thumb to determine how small a small fraction is, but typically, any value below 5% can be considered rare.

Infrequent labels often appear only on the train set or only on the test set, thus making the algorithms prone to overfitting or being unable to score an observation. In addition, when encoding categories to numbers, we only create mappings for those categories observed in the train set, so we won’t know how to encode new labels. To avoid these complications, we can group infrequent categories into a single category called Rare or Other.

In this recipe, we will group infrequent categories using pandas and Feature-engine.

How to do it...

First, let’s import the necessary Python libraries and get the dataset ready:

  1. Import the necessary Python libraries, functions, and classes:
    import numpy as np
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from feature_engine.categorical_encoders import RareLabelEncoder
  2. Let’s load the dataset and divide it into train and test sets:
    data = pd.read_csv("credit_approval_uci.csv")
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop(labels=["target"], axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  3. Let’s capture the fraction of observations per category in A7 in a variable:
    freqs = X_train["A7"].value_counts(normalize=True)

We can see the percentage of observations per category of A7, expressed as decimals, in the following output after executing print(freqs):

v	0.573499
h	0.209110
ff	0.084886
bb	0.080745
z	0.014493
dd	0.010352
j	0.010352
Missing	0.008282
n	0.006211
o	0.002070
Name: A7, dtype: float64

If we consider those labels present in less than 5% of the observations as rare, then z, dd, j, Missing, n, and o are rare categories.

  1. Let’s create a list containing the names of the categories present in more than 5% of the observations:
    frequent_cat = [
        x for x in freqs.loc[freqs > 0.05].index.values]

If we execute print(frequent_cat), we will see the frequent categories of A7:

['v', 'h', 'ff', 'bb'].
  1. Let’s replace rare labels – that is, those present in <= 5% of the observations – with the "Rare" string:
    X_train["A7"] = np.where(
        X_train["A7"].isin(frequent_cat),
        X_train["A7"], "Rare"
    )
    X_test["A7"] = np.where(
        X_test["A7"].isin(frequent_cat),
        X_test["A7"], "Rare"
        )
  2. Let’s determine the percentage of observations in the encoded variable:
    X_train["A7"].value_counts(normalize=True)

We can see that the infrequent labels have now been re-grouped into the Rare category:

v       0.573499 h       0.209110 ff      0.084886 bb      0.080745 Rare    0.051760 Name: A7, dtype: float64

Now, let’s group rare labels using Feature-engine. First, we must divide the dataset into train and test sets, as we did in step 2.

  1. Let’s create a rare label encoder that groups categories present in less than 5% of the observations, provided that the categorical variable has more than four distinct values:
    rare_encoder = RareLabelEncoder(tol=0.05, n_categories=4)
  2. Let’s fit the encoder so that it finds the categorical variables and then learns their most frequent categories:
    rare_encoder.fit(X_train)

Tip

Upon fitting, the transformer will raise warnings, indicating that many categorical variables have less than four categories, thus their values will not be grouped. The transformer just lets you know that this is happening.

We can display the frequent categories per variable by executing rare_encoder.encoder_dict_, as well as the variables that will be encoded by executing rare_encoder.variables_.

  1. Finally, let’s group rare labels in the train and test sets:
    X_train_enc = rare_encoder.transform(X_train)
    X_test_enc = rare_encoder.transform(X_test)

Now that we have grouped rare labels, we are ready to encode the categorical variables, as we’ve done in other recipes in this chapter.

How it works...

In this recipe, we grouped infrequent categories using pandas and Feature-engine.

We determined the fraction of observations per category of the A7 variable using pandas value_counts() by setting the normalize parameter to True. Using list comprehension, we captured the names of the variables present in more than 5% of the observations. Finally, using NumPy’s where(), we searched each row of A7, and if the observation was one of the frequent categories in the list, which we checked using the pandas isin() method, its value was kept; otherwise, its original value was replaced with "Rare".

We automated the preceding steps for multiple categorical variables using Feature-engine. For this, we used Feature-engine’s RareLabelEncoder(). By setting tol to 0.05, we retained categories present in more than 5% of the observations. By setting n_categories to 4, we only group rare categories in variables with more than four unique values. With the fit() method, the transformer identified the categorical variables and then learned and stored their frequent categories. With the transform() method, the transformer replaced infrequent categories with the "Rare" string.

You have been reading a chapter from
Python Feature Engineering Cookbook - Second Edition
Published in: Oct 2022
Publisher: Packt
ISBN-13: 9781804611302
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image