Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Feature Engineering Cookbook

You're reading from   Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Arrow left icon
Product type Paperback
Published in Oct 2022
Publisher Packt
ISBN-13 9781804611302
Length 386 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Soledad Galli Soledad Galli
Author Profile Icon Soledad Galli
Soledad Galli
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Chapter 1: Imputing Missing Data 2. Chapter 2: Encoding Categorical Variables FREE CHAPTER 3. Chapter 3: Transforming Numerical Variables 4. Chapter 4: Performing Variable Discretization 5. Chapter 5: Working with Outliers 6. Chapter 6: Extracting Features from Date and Time Variables 7. Chapter 7: Performing Feature Scaling 8. Chapter 8: Creating New Features 9. Chapter 9: Extracting Features from Relational Data with Featuretools 10. Chapter 10: Creating Features from a Time Series with tsfresh 11. Chapter 11: Extracting Features from Text Variables 12. Index 13. Other Books You May Enjoy

Replacing categories with counts or the frequency of observations

In count or frequency encoding, we replace the categories with the count or the fraction of observations showing that category. That is, if 10 out of 100 observations show the category blue for the Color variable, we would replace blue with 10 when doing count encoding, or with 0.1 if performing frequency encoding. These encoding methods, which capture the representation of each label in a dataset, are very popular in data science competitions. The assumption is that the number of observations per category is somewhat predictive of the target.

Tip

Note that if two different categories are present in the same number of observations, they will be replaced by the same value, which leads to information loss.

In this recipe, we will perform count and frequency encoding using pandas, Feature-engine, and Category Encoders.

How to do it...

Let’s begin by making some imports and preparing the data:

  1. Import pandas and the required function:
    import pandas as pd
    from sklearn.model_selection import train_test_split
  2. Let’s load the dataset and divide it into train and test sets:
    data = pd.read_csv("credit_approval_uci.csv")
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop(labels=["target"], axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  3. Let’s count the number of observations per category of the A7 variable and capture it in a dictionary:
    counts = X_train["A7"].value_counts().to_dict()

Tip

To encode categories with their frequency, execute X_train["A6"].value_counts(normalize=True).to_dict().

If we execute print(counts), we can observe the count of observations per category:

{'v': 277, 'h': 101, 'ff': 41, 'bb': 39, 'z': 7, 'dd': 5, 'j': 5, 'Missing': 4,, 'n': 3, 'o': 1}
  1. Let’s replace the categories in A7 with the counts:
    X_train["A7"] = X_train["A7"].map(counts)
    X_test["A7"] = X_test["A7"].map(counts)

Go ahead and inspect the data by executing X_train.head() to corroborate that the categories have been replaced by the counts.

Now, let’s carry out count encoding using Feature-engine. First, let’s load and divide the dataset, as we did in step 2.

  1. Let’s import the count encoder from Feature-engine:
    from feature_engine.encoding import CountFrequencyEncoder
  2. Let’s set up the encoder so that it encodes all categorical variables with the count of observations:
    count_enc = CountFrequencyEncoder(
        encoding_method="count", variables=None,
    )

Tip

CountFrequencyEncoder() will automatically find and encode all categorical variables in the train set. To encode only a subset of the variables, we can pass the variable names in a list to the variables argument.

  1. Let’s fit the encoder to the train set so that it stores the number of observations per category per variable:
    count_enc.fit(X_train)

Tip

The dictionaries with the category-to-counts pairs are stored in the encoder_dict_ attribute and can be displayed by executing count_enc.encoder_dict_.

  1. Finally, let’s replace the categories with counts in the train and test sets:
    X_train_enc = count_enc.transform(X_train)
    X_test_enc = count_enc.transform(X_test)

Tip

If there are categories in the test set that were not present in the train set, the transformer will replace those with np.nan and return a warning to make you aware of this. A good idea to prevent this behavior is to group infrequent labels, as described in the Grouping rare or infrequent categories recipe.

The encoder returns pandas DataFrames with the strings of the categorical variables replaced with the counts of observations, leaving the variables ready to use in machine learning models.

To wrap up this recipe, let’s encode the variables using Category Encoders.

  1. Let’s import the encoder from Category Encoders:
    from category_encoders.count import CountEncoder
  2. Let’s set up the encoder so that it encodes all categorical variables with the count of observations:
    count_enc = CountEncoder(cols=None)

Note

CountEncoder()automatically finds and encodes all categorical variables in the train set. To encode only a subset of the categorical variables, we can pass the variable names in a list to the cols argument. To replace the categories by frequency instead, we need to set the Normalize parameter to True.

  1. Let’s fit the encoder to the train set so that it counts and stores the number of observations per category per variable:
    count_enc.fit(X_train)

Tip

The values used to replace the categories are stored in the mapping attribute and can be displayed by executing count_enc.mapping.

  1. Finally, let’s replace the categories with counts in the train and test sets:
    X_train_enc = count_enc.transform(X_train)
    X_test_enc = count_enc.transform(X_test)

Note

Categories present in the test set that were not seen in the train set are referred to as unknown categories. CountEncoder() has different options to handle unknown categories, including returning an error, treating them as missing data, or replacing them with an indicated integer. CountEncoder() can also automatically group categories with few observations.

The encoder returns pandas DataFrames with the strings of the categorical variables replaced with the counts of observations, leaving the variables ready to use in machine learning models.

How it works...

In this recipe, we replaced categories by the count of observations using pandas, Feature-engine, and Category Encoders.

Using pandas value_counts(), we determined the number of observations per category of the A7 variable, and with pandas to_dict(), we captured these values in a dictionary, where each key was a unique category, and each value the number of observations for that category. With pandas map() and using this dictionary, we replaced the categories with the observation counts in both the train and test sets.

To perform count encoding with Feature-engine, we used CountFrequencyEncoder() and set encoding_method to 'count'. We left the variables argument set to None so that the encoder automatically finds all of the categorical variables in the dataset. With the fit() method, the transformer found the categorical variables and stored the observation counts per category in the encoder_dict_ attribute. With the transform() method, the transformer replaced the categories with the counts, returning a pandas DataFrame.

Finally, we performed count encoding with CountEncoder() by setting Normalize to False. We left the cols argument set to None so that the encoder automatically finds the categorical variables in the dataset. With the fit() method, the transformer found the categorical variables and stored the category to count mappings in the mapping attribute. With the transform() method, the transformer replaced the categories with the counts in, returning a pandas DataFrame.

You have been reading a chapter from
Python Feature Engineering Cookbook - Second Edition
Published in: Oct 2022
Publisher: Packt
ISBN-13: 9781804611302
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image