Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Feature Engineering Cookbook

You're reading from   Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Arrow left icon
Product type Paperback
Published in Oct 2022
Publisher Packt
ISBN-13 9781804611302
Length 386 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Soledad Galli Soledad Galli
Author Profile Icon Soledad Galli
Soledad Galli
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Chapter 1: Imputing Missing Data 2. Chapter 2: Encoding Categorical Variables FREE CHAPTER 3. Chapter 3: Transforming Numerical Variables 4. Chapter 4: Performing Variable Discretization 5. Chapter 5: Working with Outliers 6. Chapter 6: Extracting Features from Date and Time Variables 7. Chapter 7: Performing Feature Scaling 8. Chapter 8: Creating New Features 9. Chapter 9: Extracting Features from Relational Data with Featuretools 10. Chapter 10: Creating Features from a Time Series with tsfresh 11. Chapter 11: Extracting Features from Text Variables 12. Index 13. Other Books You May Enjoy

Creating binary variables through one-hot encoding

In one-hot encoding, we represent a categorical variable as a group of binary variables, where each binary variable represents one category. The binary variable takes a value of 1 if the category is present in an observation, or 0 otherwise.

The following table shows the one-hot encoded representation of the Gender variable with the categories of Male and Female:

Figure 2.2 – One-hot encoded representation of the Gender variable

Figure 2.2 – One-hot encoded representation of the Gender variable

As shown in Figure 2.2, from the Gender variable, we can derive the binary variable of Female, which shows the value of 1 for females, or the binary variable of Male, which takes the value of 1 for the males in the dataset.

For the categorical variable of Color with the values of red, blue, and green, we can create three variables called red, blue, and green. These variables will take the value of 1 if the observation is red, blue, or green, respectively, or 0 otherwise.

A categorical variable with k unique categories can be encoded using k-1 binary variables. For Gender, k is 2 as it contains two labels (male and female), so we only need to create one binary variable (k - 1 = 1) to capture all of the information. For the Color variable, which has three categories (k=3; red, blue, and green), we need to create two (k - 1 = 2) binary variables to capture all the information so that the following occurs:

  • If the observation is red, it will be captured by the red variable (red = 1, blue = 0).
  • If the observation is blue, it will be captured by the blue variable (red = 0, blue = 1)
  • If the observation is green, it will be captured by the combination of red and blue (red = 0, blue = 0)

Encoding into k-1 binary variables is well-suited for linear models. There are a few occasions in which we may prefer to encode the categorical variables with k binary variables:

  • When training decision trees since they do not evaluate the entire feature space at the same time
  • When selecting features recursively
  • When determining the importance of each category within a variable

In this recipe, we will compare the one-hot encoding implementations of pandas, scikit-learn, Feature-engine, and Category Encoders.

How to do it...

First, let’s make a few imports and get the data ready:

  1. Import pandas and the train_test_split function from scikit-learn:
    import pandas as pd
    from sklearn.model_selection import train_test_split
  2. Let’s load the Credit Approval dataset:
    data = pd.read_csv("credit_approval_uci.csv")
  3. Let’s separate the data into train and test sets:
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop(labels=["target"], axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  4. Let’s inspect the unique categories of the A4 variable:
    X_train["A4"].unique()

We can see the unique values of A4 in the following output:

array(['u', 'y', 'Missing', 'l'], dtype=object)
  1. Let’s encode A4 into k-1 binary variables using pandas and then inspect the first five rows of the resulting DataFrame:
    dummies = pd.get_dummies(X_train["A4"], drop_first=True)
    dummies.head()

Note

With pandas get_dummies(), we can either ignore or encode missing data through the dummy_na parameter. By setting dummy_na=True, missing data will be encoded in a new binary variable. To encode the variable into k dummies, use drop_first=False instead.

Here, we can see the output of step 5, where each label is now a binary variable:

	l	u	y
596	0	1	0
303	0	1	0
204	0	0	1
351	0	0	1
118	0	1	0
  1. Now, let’s encode all of the categorical variables into k-1 binaries, capturing the result in a new DataFrame:
    X_train_enc = pd.get_dummies(X_train, drop_first=True)
    X_test_enc = pd.get_dummies(X_test, drop_first=True)

Note

The get_dummies method from pandas will automatically encode all variables of the object or type. We can encode a subset of the variables by passing the variable names in a list to the columns parameter.

  1. Let’s inspect the first five rows of the binary variables created in step 6:
    X_train_enc.head()

Note

When encoding more than one variable, get_dummies() captures the variable name – say, A1 – and places an underscore followed by the category name to identify the resulting binary variables.

We can see the binary variables in the following output:

Figure 2.3 – Transformed DataFrame showing the dummy variables on the right

Figure 2.3 – Transformed DataFrame showing the dummy variables on the right

Note

The get_dummies() method will create one binary variable per seen category. Hence, if there are more categories in the train set than in the test set, get_dummies() will return more columns in the transformed train set than in the transformed test set, and vice versa. To avoid this, it is better to carry out one-hot encoding with scikit-learn or Feature-engine, as we will discuss later in this recipe.

  1. Let’s concatenate the binary variables to the original dataset:
    X_test_enc = pd.concat([X_test, X_test_enc], axis=1)
  2. Now, let’s drop the categorical variables from the data:
    X_test_enc.drop(
        labels=X_test_enc.select_dtypes(
            include="O").columns,
        axis=1,
        inplace=True,
    )

And that’s it! Now, we can use our categorical variables to train mathematical models. To inspect the result, use X_test_enc.head().

Now, let’s do one-hot encoding using scikit-learn.

  1. Import the encoder from scikit-learn:
    from sklearn.preprocessing import OneHotEncoder
  2. Let’s set up the transformer. By setting drop to "first", we encode into k-1 binary variables, and by setting sparse to False, the transformer will return a NumPy array (instead of a sparse matrix):
    encoder = OneHotEncoder(drop="first", sparse=False)

Tip

We can encode variables into k dummies by setting the drop parameter to None. We can also encode into k-1 if variables contain two categories and into k if variables contain more than two categories by setting the drop parameter to “if_binary”. The latter is useful because encoding binary variables into k dummies is redundant.

  1. First, let’s create a list containing the variable names:
    vars_categorical = X_train.select_dtypes(
        include="O").columns.to_list()
  2. Let’s fit the encoder to a slice of the train set with the categorical variables:
    encoder.fit(X_train[vars_categorical])
  3. Let’s inspect the categories for which dummy variables will be created:
    encoder.categories_

We can see the result of the preceding command here:

Figure 2.4 – Arrays with the categories that will be encoded into binary variables, one array per variable

Figure 2.4 – Arrays with the categories that will be encoded into binary variables, one array per variable

Note

Scikit-learn’s OneHotEncoder() will only encode the categories learned from the train set. If there are new categories in the test set, we can instruct the encoder to ignore them or to return an error by setting the handle_unknown parameter to 'ignore' or 'error', respectively.

  1. Let’s create the NumPy arrays with the binary variables for the train and test sets:
    X_train_enc = encoder.transform(
        X_train[vars_categorical])
    X_test_enc = encoder.transform(
        X_test[vars_categorical])
  2. Let’s extract the names of the binary variables:
    encoder.get_feature_names_out()

We can see the binary variable names that were returned in the following output:

Figure 2.5 – Arrays with the names of the one-hot encoded variables

Figure 2.5 – Arrays with the names of the one-hot encoded variables

  1. Let’s convert the array into a pandas DataFrame and add the variable names:
    X_test_enc = pd.DataFrame(X_test_enc)
    X_test_enc.columns = encoder.get_feature_names_out()
  2. To concatenate the one-hot encoded data to the original dataset, we need to make their indexes match:
    X_test_enc.index = X_test.index

Now, we are ready to concatenate the one-hot encoded variables to the original data and then remove the categorical variables using steps 8 and 9 from this recipe.

To follow up, let’s perform one-hot encoding with Feature-engine.

  1. Let’s import the encoder from Feature-engine:
    from feature_engine.encoding import OneHotEncoder
  2. Next, let’s set up the encoder so that it returns k-1 binary variables:
    ohe_enc = OneHotEncoder(drop_last=True)

Tip

Feature-engine automatically finds the categorical variables. To encode only a subset of the variables, we can pass the variable names in a list: OneHotCategoricalEncoder(variables=["A1", "A4"]). To encode numerical variables, we can set the ignore_format parameter to True or cast the variables as the object type. This is useful because sometimes, numerical variables are used to represent categories, such as postcodes.

  1. Let’s fit the encoder to the train set so that it learns the categories and variables to encode:
    ohe_enc.fit(X_train)
  2. Let’s explore the variables that will be encoded:
    ohe_enc.variables_

The transformer found and stored the variables of the object or categorical type, as shown in the following output:

['A1', 'A4', 'A5', 'A6', 'A7', 'A9', 'A10', 'A12', 'A13']

Note

Feature-engine’s OneHotEncoder has the option to encode most variables into k dummies, while only returning k-1 dummies for binary variables. For this behavior, set the drop_last_binary parameter to True.

  1. Let’s explore the categories for which dummy variables will be created:
    ohe_enc.encoder_dict_

The following dictionary contains the categories that will be encoded in each variable:

{'A1': ['a', 'b'],
 'A4': ['u', 'y', 'Missing'],
 'A5': ['g', 'p', 'Missing'],
 'A6': ['c', 'q', 'w', 'ff', 'm', 'i', 'e', 'cc', 'x', 
 'd',      'k', 'j', 'Missing', 'aa'],
 'A7': ['v', 'ff', 'h', 'dd', 'z', 'bb', 'j', 'Missing', 
 'n'],
 'A9': ['t'],
 'A10': ['t'],
 'A12': ['t'],
 'A13': ['g', 's']}
  1. Let’s encode the categorical variables in train and test sets:
    X_train_enc = ohe_enc.transform(X_train)
    X_test_enc = ohe_enc.transform(X_test)

Tip

Feature-engine’s OneHotEncoder() returns a copy of the original dataset plus the binary variables and without the original categorical variables. Thus, this data is ready to train machine learning models.

If we execute X_train_enc.head(), we will see the following DataFrame:

Figure 2.6 – Transformed DataFrame with the one-hot encoded variables on the right

Figure 2.6 – Transformed DataFrame with the one-hot encoded variables on the right

Note how the A4 categorical variable was replaced with A4_u, A4_y, and so on.

Note

We can get the names of all the variables in the transformed dataset by executing ohe_enc.get_feature_names_out().

How it works...

In this recipe, we performed a one-hot encoding of categorical variables using pandas, scikit-learn, Feature-engine, and Category Encoders.

With get_dummies() from pandas, we automatically created binary variables for each of the categories in the categorical variables.

The OneHotEncoder transformers from the scikit-learn and Feature-engine libraries share the fit() and transform() methods. With fit(), the encoders learned the categories for which the dummy variables should be created. With transform(), they returned the binary variables either in a NumPy array or added them to the original DataFrame.

Tip

One-hot encoding expands the feature space. From nine original categorical variables, we created 36 binary ones. If our datasets contain many categorical variables or highly cardinal variables, we will easily increase the feature space dramatically, which increases the computational cost of training machine learning models or obtaining their predictions and may also deteriorate their performance.

There’s more...

We can also perform one-hot encoding using OneHotEncoder from the Category Encoders library.

OneHotEncoder() from Feature-engine and Category Encoders can automatically identify and encode categorical variables – that is, those of the object or categorical type. So does pandas get_dummies(). Scikit-learn’s OneHotEncoder(), on the other hand, will encode all variables in the dataset.

With pandas, Feature-engine, and Category Encoders, we can only encode a subset of the variables, indicating their names in a list. With scikit-learn, we need to use an additional class, ColumnTransformer(), to slice the data before the transformation.

With Feature-engine and Category Encoders, the dummy variables are added to the original dataset and the categorical variables are removed after the encoding. With scikit-learn and pandas, we need to manually perform these procedures.

Finally, using OneHotEncoder() from scikit-learn, Feature-engine, and Category Encoders, we can perform the encoding step within a scikit-learn pipeline, which is more convenient if we have various feature engineering steps or want to put the pipelines into production. pandas get_dummies() is otherwise well suited for data analysis and visualization.

You have been reading a chapter from
Python Feature Engineering Cookbook - Second Edition
Published in: Oct 2022
Publisher: Packt
ISBN-13: 9781804611302
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image