Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Feature Engineering Cookbook

You're reading from   Python Feature Engineering Cookbook A complete guide to crafting powerful features for your machine learning models

Arrow left icon
Product type Paperback
Published in Aug 2024
Publisher Packt
ISBN-13 9781835883587
Length 396 pages
Edition 3rd Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Soledad Galli Soledad Galli
Author Profile Icon Soledad Galli
Soledad Galli
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Chapter 1: Imputing Missing Data FREE CHAPTER 2. Chapter 2: Encoding Categorical Variables 3. Chapter 3: Transforming Numerical Variables 4. Chapter 4: Performing Variable Discretization 5. Chapter 5: Working with Outliers 6. Chapter 6: Extracting Features from Date and Time Variables 7. Chapter 7: Performing Feature Scaling 8. Chapter 8: Creating New Features 9. Chapter 9: Extracting Features from Relational Data with Featuretools 10. Chapter 10: Creating Features from a Time Series with tsfresh 11. Chapter 11: Extracting Features from Text Variables 12. Index 13. Other Books You May Enjoy

Removing observations with missing data

Complete Case Analysis (CCA), also called list-wise deletion of cases, consists of discarding observations with missing data. CCA can be applied to both categorical and numerical variables. With CCA, we preserve the distribution of the variables after the imputation, provided the data is missing at random and only in a small proportion of observations. However, if data is missing across many variables, CCA may lead to the removal of a large portion of the dataset.

Note

Use CCA only when a small number of observations are missing and you have good reasons to believe that they are not important to your model.

How to do it...

Let’s begin by making some imports and loading the dataset:

  1. Let’s import pandas, matplotlib, and the train/test split function from scikit-learn:
    import matplotlib.pyplot as plt
    import pandas as pd
    from sklearn.model_selection import train_test_split
  2. Let’s load and display the dataset described in the Technical requirements section:
    data = pd.read_csv("credit_approval_uci.csv")
    data.head()

    In the following image, we see the first 5 rows of data:

Figure 1.1 – First 5 rows of the dataset

Figure 1.1 – First 5 rows of the dataset

  1. Let’s proceed as we normally would if we were preparing the data to train machine learning models; by splitting the data into a training and a test set:
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop("target", axis=1),
        data["target"],
        test_size=0.30,
        random_state=42,
    )
  2. Let’s now make a bar plot with the proportion of missing data per variable in the training and test sets:
    fig, axes = plt.subplots(
        2, 1, figsize=(15, 10), squeeze=False)
    X_train.isnull().mean().plot(
        kind='bar', color='grey', ax=axes[0, 0], title="train")
    X_test.isnull().mean().plot(
        kind='bar', color='black', ax=axes[1, 0], title="test")
    axes[0, 0].set_ylabel('Fraction of NAN')
    axes[1, 0].set_ylabel('Fraction of NAN')
    plt.show()

    The previous code block returns the following bar plots with the fraction of missing data per variable in the training (top) and test sets (bottom):

Figure 1.2 – Proportion of missing data per variable

Figure 1.2 – Proportion of missing data per variable

  1. Now, we’ll remove observations if they have missing values in any variable:
    train_cca = X_train.dropna()
    test_cca = X_test.dropna()

Note

pandas’ dropna()drops observations with any missing value by default. We can remove observations with missing data in a subset of variables like this: data.dropna(subset=["A3", "A4"]).

  1. Let’s print and compare the size of the original and complete case datasets:
    print(f"Total observations: {len(X_train)}")
    print(f"Observations without NAN: {len(train_cca)}")

    We removed more than 200 observations with missing data from the training set, as shown in the following output:

    Total observations: 483
    Observations without NAN: 264
  2. After removing observations from the training and test sets, we need to align the target variables:
    y_train_cca = y_train.loc[train_cca.index]
    y_test_cca = y_test.loc[test_cca.index]

    Now, the datasets and target variables contain the rows without missing data.

  3. To drop observations with missing data utilizing feature-engine, let’s import the required transformer:
    from feature_engine.imputation import DropMissingData
  4. Let’s set up the imputer to automatically find the variables with missing data:
    cca = DropMissingData(variables=None, missing_only=True)
  5. Let’s fit the transformer so that it finds the variables with missing data:
    cca.fit(X_train)
  6. Let’s inspect the variables with NAN that the transformer found:
    cca.variables_

    The previous command returns the names of the variables with missing data:

    ['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A14']
  7. Let’s remove the rows with missing data in the training and test sets:
    train_cca = cca.transform(X_train)
    test_cca = cca.transform(X_test)

    Use train_cca.isnull().sum() to corroborate the absence of missing data in the complete case dataset.

  8. DropMissingData can automatically adjust the target after removing missing data from the training set:
    train_c, y_train_c = cca.transform_x_y( X_train, y_train)
    test_c, y_test_c = cca.transform_x_y(X_test, y_test)

The previous code removed rows with nan from the training and test sets and then re-aligned the target variables.

Note

To remove observations with missing data in a subset of variables, use DropMissingData(variables=['A3', 'A4']). To remove rows with nan in at least 5% of the variables, use DropMissingData(threshold=0.95).

How it works...

In this recipe, we plotted the proportion of missing data in each variable and then removed all observations with missing values.

We used pandas isnull() and mean() methods to determine the proportion of missing observations in each variable. The isnull() method created a Boolean vector per variable with True and False values indicating whether a value was missing. The mean() method took the average of these values and returned the proportion of missing data.

We used pandas plot.bar() to create a bar plot of the fraction of missing data per variable. In Figure 1.2, we saw the fraction of nan per variable in the training and test sets.

To remove observations with missing values in any variable, we used pandas’ dropna(), thereby obtaining a complete case dataset.

Finally, we removed missing data using Feature-engine’s DropMissingData(). This imputer automatically identified and stored the variables with missing data from the train set when we called the fit() method. With the transform() method, the imputer removed observations with nan in those variables. With transform_x_y(), the imputer removed rows with nan from the data sets and then realigned the target variable.

See also

If you want to use DropMissingData() within a pipeline together with other Feature-engine or scikit-learn transformers, check out Feature-engine’s Pipeline: https://Feature-engine.trainindata.com/en/latest/user_guide/pipeline/Pipeline.html. This pipeline can align the target with the training and test sets after removing rows.

You have been reading a chapter from
Python Feature Engineering Cookbook - Third Edition
Published in: Aug 2024
Publisher: Packt
ISBN-13: 9781835883587
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime