You're reading from Python Feature Engineering Cookbook A complete guide to crafting powerful features for your machine learning models

Product type Paperback

Published in Aug 2024

Publisher Packt

ISBN-13 9781835883587

Length 396 pages

Edition 3rd Edition

Languages

Python

Tools

Combine

Concepts

Data Science

Author (1):

Soledad Galli

View More author details

Table of Contents (14) Chapters

Preface

1. Chapter 1: Imputing Missing Data FREE CHAPTER

2. Chapter 2: Encoding Categorical Variables

3. Chapter 3: Transforming Numerical Variables

4. Chapter 4: Performing Variable Discretization

5. Chapter 5: Working with Outliers

6. Chapter 6: Extracting Features from Date and Time Variables

7. Chapter 7: Performing Feature Scaling

8. Chapter 8: Creating New Features

9. Chapter 9: Extracting Features from Relational Data with Featuretools

10. Chapter 10: Creating Features from a Time Series with tsfresh

11. Chapter 11: Extracting Features from Text Variables

12. Index

Why subscribe?

13. Other Books You May Enjoy

Removing observations with missing data

Complete Case Analysis (CCA), also called list-wise deletion of cases, consists of discarding observations with missing data. CCA can be applied to both categorical and numerical variables. With CCA, we preserve the distribution of the variables after the imputation, provided the data is missing at random and only in a small proportion of observations. However, if data is missing across many variables, CCA may lead to the removal of a large portion of the dataset.

Note

Use CCA only when a small number of observations are missing and you have good reasons to believe that they are not important to your model.

How to do it...

Let’s begin by making some imports and loading the dataset:

Let’s import pandas, matplotlib, and the train/test split function from scikit-learn:

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split

Let’s load and display the dataset described in the Technical requirements section:
```
data = pd.read_csv("credit_approval_uci.csv")
data.head()
```
In the following image, we see the first 5 rows of data:

Figure 1.1 – First 5 rows of the dataset

Let’s proceed as we normally would if we were preparing the data to train machine learning models; by splitting the data into a training and a test set:

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("target", axis=1),
    data["target"],
    test_size=0.30,
    random_state=42,
)

Let’s now make a bar plot with the proportion of missing data per variable in the training and test sets:

fig, axes = plt.subplots(
    2, 1, figsize=(15, 10), squeeze=False)
X_train.isnull().mean().plot(
    kind='bar', color='grey', ax=axes[0, 0], title="train")
X_test.isnull().mean().plot(
    kind='bar', color='black', ax=axes[1, 0], title="test")
axes[0, 0].set_ylabel('Fraction of NAN')
axes[1, 0].set_ylabel('Fraction of NAN')
plt.show()

The previous code block returns the following bar plots with the fraction of missing data per variable in the training (top) and test sets (bottom):

Figure 1.2 – Proportion of missing data per variable

Now, we’ll remove observations if they have missing values in any variable:
```
train_cca = X_train.dropna()
test_cca = X_test.dropna()
```

Note

pandas’ dropna()drops observations with any missing value by default. We can remove observations with missing data in a subset of variables like this: data.dropna(subset=["A3", "A4"]).

Let’s print and compare the size of the original and complete case datasets:
```
print(f"Total observations: {len(X_train)}")
print(f"Observations without NAN: {len(train_cca)}")
```
We removed more than 200 observations with missing data from the training set, as shown in the following output:
```
Total observations: 483
Observations without NAN: 264
```
After removing observations from the training and test sets, we need to align the target variables:
```
y_train_cca = y_train.loc[train_cca.index]
y_test_cca = y_test.loc[test_cca.index]
```
Now, the datasets and target variables contain the rows without missing data.
To drop observations with missing data utilizing feature-engine, let’s import the required transformer:
```
from feature_engine.imputation import DropMissingData
```
Let’s set up the imputer to automatically find the variables with missing data:
```
cca = DropMissingData(variables=None, missing_only=True)
```
Let’s fit the transformer so that it finds the variables with missing data:
```
cca.fit(X_train)
```
Let’s inspect the variables with NAN that the transformer found:
```
cca.variables_
```
The previous command returns the names of the variables with missing data:
```
['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A14']
```
Let’s remove the rows with missing data in the training and test sets:
```
train_cca = cca.transform(X_train)
test_cca = cca.transform(X_test)
```
Use train_cca.isnull().sum() to corroborate the absence of missing data in the complete case dataset.

DropMissingData can automatically adjust the target after removing missing data from the training set:

train_c, y_train_c = cca.transform_x_y( X_train, y_train)
test_c, y_test_c = cca.transform_x_y(X_test, y_test)

The previous code removed rows with nan from the training and test sets and then re-aligned the target variables.

Note

To remove observations with missing data in a subset of variables, use DropMissingData(variables=['A3', 'A4']). To remove rows with nan in at least 5% of the variables, use DropMissingData(threshold=0.95).

How it works...

In this recipe, we plotted the proportion of missing data in each variable and then removed all observations with missing values.

We used pandas isnull() and mean() methods to determine the proportion of missing observations in each variable. The isnull() method created a Boolean vector per variable with True and False values indicating whether a value was missing. The mean() method took the average of these values and returned the proportion of missing data.

We used pandas plot.bar() to create a bar plot of the fraction of missing data per variable. In Figure 1.2, we saw the fraction of nan per variable in the training and test sets.

To remove observations with missing values in any variable, we used pandas’ dropna(), thereby obtaining a complete case dataset.

Finally, we removed missing data using Feature-engine’s DropMissingData(). This imputer automatically identified and stored the variables with missing data from the train set when we called the fit() method. With the transform() method, the imputer removed observations with nan in those variables. With transform_x_y(), the imputer removed rows with nan from the data sets and then realigned the target variable.

You're reading from Python Feature Engineering Cookbook A complete guide to crafting powerful features for your machine learning models

Table of Contents (14) Chapters

Removing observations with missing data

How to do it...

How it works...

See also

Authors (1)

Personalised recommendations for you