Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Feature Engineering Cookbook

You're reading from   Python Feature Engineering Cookbook A complete guide to crafting powerful features for your machine learning models

Arrow left icon
Product type Paperback
Published in Aug 2024
Publisher Packt
ISBN-13 9781835883587
Length 396 pages
Edition 3rd Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Soledad Galli Soledad Galli
Author Profile Icon Soledad Galli
Soledad Galli
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Chapter 1: Imputing Missing Data FREE CHAPTER 2. Chapter 2: Encoding Categorical Variables 3. Chapter 3: Transforming Numerical Variables 4. Chapter 4: Performing Variable Discretization 5. Chapter 5: Working with Outliers 6. Chapter 6: Extracting Features from Date and Time Variables 7. Chapter 7: Performing Feature Scaling 8. Chapter 8: Creating New Features 9. Chapter 9: Extracting Features from Relational Data with Featuretools 10. Chapter 10: Creating Features from a Time Series with tsfresh 11. Chapter 11: Extracting Features from Text Variables 12. Index 13. Other Books You May Enjoy

Replacing missing values with an arbitrary number

We can replace missing data with an arbitrary value. Commonly used values are 999, 9999, or -1 for positive distributions. This method is used for numerical variables. For categorical variables, the equivalent method is to replace missing data with an arbitrary string, as described in the Imputing categorical variables recipe.

When replacing missing values with arbitrary numbers, we need to be careful not to select a value close to the mean, the median, or any other common value of the distribution.

Note

We’d use arbitrary number imputation when data is not missing at random, use non-linear models, or when the percentage of missing data is high. This imputation technique distorts the original variable distribution.

In this recipe, we will impute missing data with arbitrary numbers using pandas, scikit-learn, and feature-engine.

How to do it...

Let’s begin by importing the necessary tools and loading the data:

  1. Import pandas and the required functions and classes:
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.impute import SimpleImputer
    from feature_engine.imputation import ArbitraryNumberImputer
  2. Let’s load the dataset described in the Technical requirements section:
    data = pd.read_csv("credit_approval_uci.csv")
  3. Let’s separate the data into train and test sets:
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop("target", axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )

    We will select arbitrary values greater than the maximum value of the distribution.

  4. Let’s find the maximum value of four numerical variables:
    X_train[['A2','A3', 'A8', 'A11']].max()

    The previous command returns the following output:

    A2     76.750
    A3     26.335
    A8     28.500
    A11    67.000
    dtype: float64

    We’ll use 99 for the imputation because it is bigger than the maximum values of the numerical variables in step 4.

  5. Let’s make a copy of the original DataFrames:
    X_train_t = X_train.copy()
    X_test_t = X_test.copy()
  6. Now, we replace the missing values with 99:
    X_train_t[["A2", "A3", "A8", "A11"]] = X_train_t[[
        "A2", "A3", "A8", "A11"]].fillna(99)
    X_test_t[["A2", "A3", "A8", "A11"]] = X_test_t[[
        "A2", "A3", "A8", "A11"]].fillna(99)

Note

To impute different variables with different values using pandas fillna(), use a dictionary like this: imputation_dict = {"A2": -1, "A3": -1, "A8": 999, "A11": 9999}.

Now, we’ll impute missing values with an arbitrary number using scikit-learn.

  1. Let’s set up imputer to replace missing values with 99:
    imputer = SimpleImputer(strategy='constant', fill_value=99)

Note

If your dataset contains categorical variables, SimpleImputer() will add 99 to those variables as well if any values are missing.

  1. Let’s fit imputer to a slice of the train set containing the variables to impute:
    vars = ["A2", "A3", "A8", "A11"]
    imputer.fit(X_train[vars])
  2. Replace the missing values with 99 in the desired variables:
    X_train_t[vars] = imputer.transform(X_train[vars])
    X_test_t[vars] = imputer.transform(X_test[vars])

    Go ahead and check the lack of missing values by executing X_test_t[["A2", "A3", "A8", "A11"]].isnull().sum().

    To finish, let’s impute missing values using feature-engine.

  3. Let’s set up the imputer to replace missing values with 99 in 4 specific variables:
    imputer = ArbitraryNumberImputer(
        arbitrary_number=99,
        variables=["A2", "A3", "A8", "A11"],
    )

Note

ArbitraryNumberImputer() will automatically select all numerical variables in the train set for imputation if we set the variables parameter to None.

  1. Finally, let’s replace the missing values with 99:
    X_train = imputer.fit_transform(X_train)
    X_test = imputer.transform(X_test)

Note

To impute different variables with different numbers, set up ArbitraryNumberImputer() as follows: ArbitraryNumberImputer(imputater_dict = {"A2": -1, "A3": -1, "A8": 999, "A11": 9999}).

We have now replaced missing data with arbitrary numbers using three different open-source libraries.

How it works...

In this recipe, we replaced missing values in numerical variables with an arbitrary number using pandas, scikit-learn, and feature-engine.

To determine which arbitrary value to use, we inspected the maximum values of four numerical variables using pandas’ max(). We chose 99 because it was greater than the maximum values of the selected variables. In step 5, we used pandas fillna() to replace the missing data.

To replace missing values using scikit-learn, we utilized SimpleImputer(), with the strategy set to constant, and specified 99 in the fill_value argument. Next, we fitted the imputer to a slice of the train set with the numerical variables to impute. Finally, we replaced missing values using transform().

To replace missing values with feature-engine we used ArbitraryValueImputer(), specifying the value 99 and the variables to impute as parameters. Next, we applied the fit_transform() method to replace missing data in the train set and the transform() method to replace missing data in the test set.

You have been reading a chapter from
Python Feature Engineering Cookbook - Third Edition
Published in: Aug 2024
Publisher: Packt
ISBN-13: 9781835883587
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image