Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Feature Engineering Cookbook

You're reading from   Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Arrow left icon
Product type Paperback
Published in Oct 2022
Publisher Packt
ISBN-13 9781804611302
Length 386 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Soledad Galli Soledad Galli
Author Profile Icon Soledad Galli
Soledad Galli
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Chapter 1: Imputing Missing Data 2. Chapter 2: Encoding Categorical Variables FREE CHAPTER 3. Chapter 3: Transforming Numerical Variables 4. Chapter 4: Performing Variable Discretization 5. Chapter 5: Working with Outliers 6. Chapter 6: Extracting Features from Date and Time Variables 7. Chapter 7: Performing Feature Scaling 8. Chapter 8: Creating New Features 9. Chapter 9: Extracting Features from Relational Data with Featuretools 10. Chapter 10: Creating Features from a Time Series with tsfresh 11. Chapter 11: Extracting Features from Text Variables 12. Index 13. Other Books You May Enjoy

Replacing categories with ordinal numbers

Ordinal encoding consists of replacing the categories with digits from 1 to k (or 0 to k-1, depending on the implementation), where k is the number of distinct categories of the variable. The numbers are assigned arbitrarily. Ordinal encoding is better suited for non-linear machine learning models, which can navigate through the arbitrarily assigned digits to find patterns that relate to the target.

In this recipe, we will perform ordinal encoding using pandas, scikit-learn, and Feature-engine.

How to do it...

First, let’s import the necessary Python libraries and prepare the dataset:

  1. Import pandas and the data split function:
    import pandas as pd
    from sklearn.model_selection import train_test_split
  2. Let’s load the dataset and divide it into train and test sets:
    data = pd.read_csv("credit_approval_uci.csv")
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop(labels=["target"], axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  3. To encode the A7 variable, let’s make a dictionary of category-to-integer pairs:
    ordinal_mapping = {k: i for i, k in enumerate(
        X_train["A7"].unique(), 0)
    }

If we execute print(ordinal_mapping), we will see the digits that will replace each category:

{'v': 0, 'ff': 1, 'h': 2, 'dd': 3, 'z': 4, 'bb': 5, 'j': 6, 'Missing': 7, 'n': 8, 'o': 9}
  1. Now, let’s replace the categories with numbers in the original variables:
    X_train["A7"] = X_train["A7"].map(ordinal_mapping)
    X_test["A7"] = X_test["A7"].map(ordinal_mapping)

With print(X_train["A7"].head(10)), we can see the result of the preceding operation, where the original categories were replaced by numbers:

596	0
303	0
204	0
351	1
118	0
247	2
652	0
513	3
230	0
250	4
Name:	A7, dtype: int64

Next, let’s carry out ordinal encoding using scikit-learn. First, we need to divide the data into train and test sets, as we did in step 2.

  1. Let’s import the required classes:
    from sklearn.preprocessing import OrdinalEncoder
    from sklearn.compose import ColumnTransformer

Tip

Do not confuse OrdinalEncoder() with LabelEncoder() from scikit-learn. The former is intended to encode predictive features, whereas the latter is intended to modify the target variable.

  1. Let’s set up the encoder:
    enc = OrdinalEncoder()

Note

Scikit-learn’s OrdinalEncoder() will encode the entire dataset. To encode only a selection of variables, we need to use scikit-learn’s ColumnTransformer().

  1. Let’s make a list containing the categorical variables to encode:
    vars_categorical = X_train.select_dtypes(
        include="O").columns.to_list()
  2. Let’s make a list containing the remaining variables:
    vars_remainder = X_train.select_dtypes(
        exclude="O").columns.to_list()
  3. Now, let’s set up ColumTransformer() to encode the categorical variables. By setting the remainder parameter to "passthrough", we make ColumnTransformer() concatenate the variables that are not encoded at the back of the encoded features:
    ct = ColumnTransformer(
        [("encoder", enc, vars_categorical)],
        remainder="passthrough",
    )
  4. Let’s fit the encoder to the train set so that it creates and stores representations of categories to digits:
    ct.fit(X_train)

By executing ct.named_transformers_["encoder"].categories_, you can visualize the unique categories per variable.

  1. Now, let’s encode the categorical variables in the train and test sets:
    X_train_enc = ct.transform(X_train)
    X_test_enc = ct.transform(X_test)

Remember that scikit-learn returns a NumPy array.

  1. Let’s transform the arrays into pandas DataFrames by adding the columns:
    X_train_enc = pd.DataFrame(
        X_train_enc, columns=vars_categorical+vars_remainder)
    X_test_enc = pd.DataFrame(
        X_test_enc, columns=vars_categorical+vars_remainder)

Note

Note that, with ColumnTransformer(), the variables that were not encoded will be returned to the right of the DataFrame, following the encoded variables. You can visualize the output of step 12 with X_train_enc.head().

Now, let’s do ordinal encoding with Feature-engine. First, we must divide the dataset, as we did in step 2.

  1. Let’s import the encoder:
    from feature_engine.encoding import OrdinalEncoder
  2. Let’s set up the encoder so that it replaces categories with arbitrary integers in the categorical variables specified in step 7:
    enc = OrdinalEncoder(encoding_method="arbitrary", variables=vars_categorical)

Note

Feature-engine’s OrdinalEncoder automatically finds and encodes all categorical variables if the variables parameter is left set to None. Alternatively, it will encode the variables indicated in the list. In addition, Feature-engine’s OrdinalEncoder() can assign the integers according to the target mean value (see the Performing ordinal encoding based on the target value recipe).

  1. Let’s fit the encoder to the train set so that it learns and stores the category-to-integer mappings:
    enc.fit(X_train)

Tip

The category to integer mappings are stored in the encoder_dict_ attribute and can be accessed by executing enc.encoder_dict_.

  1. Finally, let’s encode the categorical variables in the train and test sets:
    X_train_enc = enc.transform(X_train)
    X_test_enc = enc.transform(X_test)

Feature-engine returns pandas DataFrames where the values of the original variables are replaced with numbers, leaving the DataFrame ready to use in machine learning models.

How it works...

In this recipe, we replaced categories with integers assigned arbitrarily.

With pandas unique(), we returned the unique values of the A7 variable, and using Python’s list comprehension syntax, we created a dictionary of key-value pairs, where each key was one of the A7 variable’s unique categories, and each value was the digit that would replace the category. Finally, we used pandas map() to replace the strings in A7 with the integers.

Next, we carried out ordinal encoding using scikit-learn’s OrdinalEncoder() and used ColumnTransformer() to select the columns to encode. With the fit() method, the transformer created the category-to-integer mappings based on the categories in the train set. With the transform() method, the categories were replaced with integers, returning a NumPy array. ColumnTransformer() sliced the DataFrame into the categorical variables to encode, and then concatenated the remaining variables at the right of the encoded features.

To perform ordinal encoding with Feature-engine, we used OrdinalEncoder(), indicating that the integers should be assigned arbitrarily in encoding_method and passing a list with the variables to encode in the variables argument. With the fit() method, the encoder assigned integers to each variable’s categories, which were stored in the encoder_dict_ attribute. These mappings were then used by the transform() method to replace the categories in the train and test sets, returning DataFrames.

There’s more...

You can also carry out ordinal encoding with OrdinalEncoder() from Category Encoders.

The transformers from Feature-engine and Category Encoders can automatically identify and encode categorical variables – that is, those of the object or categorical type. They also allow us to encode only a subset of the variables.

scikit-learn’s transformer will otherwise encode all variables in the dataset. To encode just a subset, we need to use an additional class, ColumnTransformer(), to slice the data before the transformation.

Feature-engine and Category Encoders return pandas DataFrames, whereas scikit-learn returns NumPy arrays.

Finally, each class has additional functionality. For example, with scikit-learn, we can encode only a subset of the categories, whereas Feature-engine allows us to replace categories with integers that are assigned based on the target mean value. On the other hand, Category Encoders can automatically handle missing data and offers alternative options to work with unseen categories.

You have been reading a chapter from
Python Feature Engineering Cookbook - Second Edition
Published in: Oct 2022
Publisher: Packt
ISBN-13: 9781804611302
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image