You're reading from Python Feature Engineering Cookbook A complete guide to crafting powerful features for your machine learning models

Product type Paperback

Published in Aug 2024

Publisher Packt

ISBN-13 9781835883587

Length 396 pages

Edition 3rd Edition

Languages

Python

Tools

Combine

Concepts

Data Science

Author (1):

Soledad Galli

View More author details

Table of Contents (14) Chapters

Preface

1. Chapter 1: Imputing Missing Data FREE CHAPTER

2. Chapter 2: Encoding Categorical Variables

3. Chapter 3: Transforming Numerical Variables

4. Chapter 4: Performing Variable Discretization

5. Chapter 5: Working with Outliers

6. Chapter 6: Extracting Features from Date and Time Variables

7. Chapter 7: Performing Feature Scaling

8. Chapter 8: Creating New Features

9. Chapter 9: Extracting Features from Relational Data with Featuretools

10. Chapter 10: Creating Features from a Time Series with tsfresh

11. Chapter 11: Extracting Features from Text Variables

12. Index

Why subscribe?

13. Other Books You May Enjoy

Replacing missing values with an arbitrary number

We can replace missing data with an arbitrary value. Commonly used values are 999, 9999, or -1 for positive distributions. This method is used for numerical variables. For categorical variables, the equivalent method is to replace missing data with an arbitrary string, as described in the Imputing categorical variables recipe.

When replacing missing values with arbitrary numbers, we need to be careful not to select a value close to the mean, the median, or any other common value of the distribution.

Note

We’d use arbitrary number imputation when data is not missing at random, use non-linear models, or when the percentage of missing data is high. This imputation technique distorts the original variable distribution.

In this recipe, we will impute missing data with arbitrary numbers using pandas, scikit-learn, and feature-engine.

How to do it...

Let’s begin by importing the necessary tools and loading the data:

Import pandas and the required functions and classes:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from feature_engine.imputation import ArbitraryNumberImputer

Let’s load the dataset described in the Technical requirements section:
```
data = pd.read_csv("credit_approval_uci.csv")
```

Let’s separate the data into train and test sets:

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("target", axis=1),
    data["target"],
    test_size=0.3,
    random_state=0,
)

We will select arbitrary values greater than the maximum value of the distribution.

Let’s find the maximum value of four numerical variables:
```
X_train[['A2','A3', 'A8', 'A11']].max()
```
The previous command returns the following output:
```
A2     76.750
A3     26.335
A8     28.500
A11    67.000
dtype: float64
```
We’ll use 99 for the imputation because it is bigger than the maximum values of the numerical variables in step 4.

Let’s make a copy of the original DataFrames:

X_train_t = X_train.copy()
X_test_t = X_test.copy()

Now, we replace the missing values with 99:

X_train_t[["A2", "A3", "A8", "A11"]] = X_train_t[[
    "A2", "A3", "A8", "A11"]].fillna(99)
X_test_t[["A2", "A3", "A8", "A11"]] = X_test_t[[
    "A2", "A3", "A8", "A11"]].fillna(99)

Note

To impute different variables with different values using pandas fillna(), use a dictionary like this: imputation_dict = {"A2": -1, "A3": -1, "A8": 999, "A11": 9999}.

Now, we’ll impute missing values with an arbitrary number using scikit-learn.

Let’s set up imputer to replace missing values with 99:

imputer = SimpleImputer(strategy='constant', fill_value=99)

Note

If your dataset contains categorical variables, SimpleImputer() will add 99 to those variables as well if any values are missing.

Let’s fit imputer to a slice of the train set containing the variables to impute:
```
vars = ["A2", "A3", "A8", "A11"]
imputer.fit(X_train[vars])
```
Replace the missing values with 99 in the desired variables:
```
X_train_t[vars] = imputer.transform(X_train[vars])
X_test_t[vars] = imputer.transform(X_test[vars])
```
Go ahead and check the lack of missing values by executing X_test_t[["A2", "A3", "A8", "A11"]].isnull().sum().
To finish, let’s impute missing values using feature-engine.

Let’s set up the imputer to replace missing values with 99 in 4 specific variables:

imputer = ArbitraryNumberImputer(
    arbitrary_number=99,
    variables=["A2", "A3", "A8", "A11"],
)

Note

ArbitraryNumberImputer() will automatically select all numerical variables in the train set for imputation if we set the variables parameter to None.

Finally, let’s replace the missing values with 99:

X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

Note

To impute different variables with different numbers, set up ArbitraryNumberImputer() as follows: ArbitraryNumberImputer(imputater_dict = {"A2": -1, "A3": -1, "A8": 999, "A11": 9999}).

We have now replaced missing data with arbitrary numbers using three different open-source libraries.

How it works...

In this recipe, we replaced missing values in numerical variables with an arbitrary number using pandas, scikit-learn, and feature-engine.

To determine which arbitrary value to use, we inspected the maximum values of four numerical variables using pandas’ max(). We chose 99 because it was greater than the maximum values of the selected variables. In step 5, we used pandas fillna() to replace the missing data.

To replace missing values using scikit-learn, we utilized SimpleImputer(), with the strategy set to constant, and specified 99 in the fill_value argument. Next, we fitted the imputer to a slice of the train set with the numerical variables to impute. Finally, we replaced missing values using transform().

To replace missing values with feature-engine we used ArbitraryValueImputer(), specifying the value 99 and the variables to impute as parameters. Next, we applied the fit_transform() method to replace missing data in the train set and the transform() method to replace missing data in the test set.

You're reading from Python Feature Engineering Cookbook A complete guide to crafting powerful features for your machine learning models

Table of Contents (14) Chapters

Replacing missing values with an arbitrary number

How to do it...

How it works...

Authors (1)

Personalised recommendations for you