Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Python Feature Engineering Cookbook
Python Feature Engineering Cookbook

Python Feature Engineering Cookbook: A complete guide to crafting powerful features for your machine learning models , Third Edition

eBook
€23.99 €26.99
Paperback
€33.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with Print?

Product feature icon Instant access to your digital copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Redeem a companion digital copy on all Print orders
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Python Feature Engineering Cookbook

Imputing Missing Data

Missing data—meaning the absence of values for certain observations—is an unavoidable problem in most data sources. Some machine learning model implementations can handle missing data out of the box. To train other models, we must remove observations with missing data or transform them into permitted values.

The act of replacing missing data with their statistical estimates is called imputation. The goal of any imputation technique is to produce a complete dataset. There are multiple imputation methods. We select which one to use, depending on whether the data is missing at random, the proportion of missing values, and the machine learning model we intend to use. In this chapter, we will discuss several imputation methods.

This chapter will cover the following recipes:

  • Removing observations with missing data
  • Performing mean or median imputation
  • Imputing categorical variables
  • Replacing missing values with an arbitrary number
  • Finding extreme values for imputation
  • Marking imputed values
  • Implementing forward and backward fill
  • Carrying out interpolation
  • Performing multivariate imputation by chained equations
  • Estimating missing data with nearest neighbors

Technical requirements

In this chapter, we will use the Python libraries Matplotlib, pandas, NumPy, scikit-learn, and Feature-engine. If you need to install Python, the free Anaconda Python distribution (https://www.anaconda.com/) includes most numerical computing libraries.

feature-engine can be installed with pip as follows:

pip install feature-engine

If you use Anaconda, you can install feature-engine with conda:

conda install -c conda-forge feature_engine

Note

The recipes from this chapter were created using the latest versions of the Python libraries at the time of publishing. You can check the versions in the requirements.txt file in the accompanying GitHub repository, at https://github.com/PacktPublishing/Python-Feature-engineering-Cookbook-Third-Edition/blob/main/requirements.txt.

We will use the Credit Approval dataset from the UCI Machine Learning Repository (https://archive.ics.uci.edu/), licensed under the CC BY 4.0 creative commons attribution: https://creativecommons.org/licenses/by/4.0/legalcode. You’ll find the dataset at this link: http://archive.ics.uci.edu/dataset/27/credit+approval.

I downloaded and modified the data as shown in this notebook: https://github.com/PacktPublishing/Python-Feature-engineering-Cookbook-Third-Edition/blob/main/ch01-missing-data-imputation/credit-approval-dataset.ipynb

We will also use the air passenger dataset located in Facebook’s Prophet GitHub repository (https://github.com/facebook/prophet/blob/main/examples/example_air_passengers.csv), licensed under the MIT license: https://github.com/facebook/prophet/blob/main/LICENSE

I modified the data as shown in this notebook: https://github.com/PacktPublishing/Python-Feature-engineering-Cookbook-Third-Edition/blob/main/ch01-missing-data-imputation/air-passengers-dataset.ipynb

You’ll find a copy of the modified data sets in the accompanying GitHub repository: https://github.com/PacktPublishing/Python-Feature-engineering-Cookbook-Third-Edition/blob/main/ch01-missing-data-imputation/

Removing observations with missing data

Complete Case Analysis (CCA), also called list-wise deletion of cases, consists of discarding observations with missing data. CCA can be applied to both categorical and numerical variables. With CCA, we preserve the distribution of the variables after the imputation, provided the data is missing at random and only in a small proportion of observations. However, if data is missing across many variables, CCA may lead to the removal of a large portion of the dataset.

Note

Use CCA only when a small number of observations are missing and you have good reasons to believe that they are not important to your model.

How to do it...

Let’s begin by making some imports and loading the dataset:

  1. Let’s import pandas, matplotlib, and the train/test split function from scikit-learn:
    import matplotlib.pyplot as plt
    import pandas as pd
    from sklearn.model_selection import train_test_split
  2. Let’s load and display the dataset described in the Technical requirements section:
    data = pd.read_csv("credit_approval_uci.csv")
    data.head()

    In the following image, we see the first 5 rows of data:

Figure 1.1 – First 5 rows of the dataset

Figure 1.1 – First 5 rows of the dataset

  1. Let’s proceed as we normally would if we were preparing the data to train machine learning models; by splitting the data into a training and a test set:
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop("target", axis=1),
        data["target"],
        test_size=0.30,
        random_state=42,
    )
  2. Let’s now make a bar plot with the proportion of missing data per variable in the training and test sets:
    fig, axes = plt.subplots(
        2, 1, figsize=(15, 10), squeeze=False)
    X_train.isnull().mean().plot(
        kind='bar', color='grey', ax=axes[0, 0], title="train")
    X_test.isnull().mean().plot(
        kind='bar', color='black', ax=axes[1, 0], title="test")
    axes[0, 0].set_ylabel('Fraction of NAN')
    axes[1, 0].set_ylabel('Fraction of NAN')
    plt.show()

    The previous code block returns the following bar plots with the fraction of missing data per variable in the training (top) and test sets (bottom):

Figure 1.2 – Proportion of missing data per variable

Figure 1.2 – Proportion of missing data per variable

  1. Now, we’ll remove observations if they have missing values in any variable:
    train_cca = X_train.dropna()
    test_cca = X_test.dropna()

Note

pandas’ dropna()drops observations with any missing value by default. We can remove observations with missing data in a subset of variables like this: data.dropna(subset=["A3", "A4"]).

  1. Let’s print and compare the size of the original and complete case datasets:
    print(f"Total observations: {len(X_train)}")
    print(f"Observations without NAN: {len(train_cca)}")

    We removed more than 200 observations with missing data from the training set, as shown in the following output:

    Total observations: 483
    Observations without NAN: 264
  2. After removing observations from the training and test sets, we need to align the target variables:
    y_train_cca = y_train.loc[train_cca.index]
    y_test_cca = y_test.loc[test_cca.index]

    Now, the datasets and target variables contain the rows without missing data.

  3. To drop observations with missing data utilizing feature-engine, let’s import the required transformer:
    from feature_engine.imputation import DropMissingData
  4. Let’s set up the imputer to automatically find the variables with missing data:
    cca = DropMissingData(variables=None, missing_only=True)
  5. Let’s fit the transformer so that it finds the variables with missing data:
    cca.fit(X_train)
  6. Let’s inspect the variables with NAN that the transformer found:
    cca.variables_

    The previous command returns the names of the variables with missing data:

    ['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A14']
  7. Let’s remove the rows with missing data in the training and test sets:
    train_cca = cca.transform(X_train)
    test_cca = cca.transform(X_test)

    Use train_cca.isnull().sum() to corroborate the absence of missing data in the complete case dataset.

  8. DropMissingData can automatically adjust the target after removing missing data from the training set:
    train_c, y_train_c = cca.transform_x_y( X_train, y_train)
    test_c, y_test_c = cca.transform_x_y(X_test, y_test)

The previous code removed rows with nan from the training and test sets and then re-aligned the target variables.

Note

To remove observations with missing data in a subset of variables, use DropMissingData(variables=['A3', 'A4']). To remove rows with nan in at least 5% of the variables, use DropMissingData(threshold=0.95).

How it works...

In this recipe, we plotted the proportion of missing data in each variable and then removed all observations with missing values.

We used pandas isnull() and mean() methods to determine the proportion of missing observations in each variable. The isnull() method created a Boolean vector per variable with True and False values indicating whether a value was missing. The mean() method took the average of these values and returned the proportion of missing data.

We used pandas plot.bar() to create a bar plot of the fraction of missing data per variable. In Figure 1.2, we saw the fraction of nan per variable in the training and test sets.

To remove observations with missing values in any variable, we used pandas’ dropna(), thereby obtaining a complete case dataset.

Finally, we removed missing data using Feature-engine’s DropMissingData(). This imputer automatically identified and stored the variables with missing data from the train set when we called the fit() method. With the transform() method, the imputer removed observations with nan in those variables. With transform_x_y(), the imputer removed rows with nan from the data sets and then realigned the target variable.

See also

If you want to use DropMissingData() within a pipeline together with other Feature-engine or scikit-learn transformers, check out Feature-engine’s Pipeline: https://Feature-engine.trainindata.com/en/latest/user_guide/pipeline/Pipeline.html. This pipeline can align the target with the training and test sets after removing rows.

Performing mean or median imputation

Mean or median imputation consists of replacing missing data with the variable’s mean or median value. To avoid data leakage, we determine the mean or median using the train set, and then use these values to impute the train and test sets, and all future data.

Scikit-learn and Feature-engine learn the mean or median from the train set and store these parameters for future use out of the box.

In this recipe, we will perform mean and median imputation using pandas, scikit-learn, and feature-engine.

Note

Use mean imputation if variables are normally distributed and median imputation otherwise. Mean and median imputation may distort the variable distribution if there is a high percentage of missing data.

How to do it...

Let’s begin this recipe:

  1. First, we’ll import pandas and the required functions and classes from scikit-learn and feature-engine:
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.impute import SimpleImputer
    from sklearn.compose import ColumnTransformer
    from feature_engine.imputation import MeanMedianImputer
  2. Let’s load the dataset that we prepared in the Technical requirements section:
    data = pd.read_csv("credit_approval_uci.csv")
  3. Let’s split the data into train and test sets with their respective targets:
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop("target", axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  4. Let’s make a list with the numerical variables by excluding variables of type object:
    numeric_vars = X_train.select_dtypes(
        exclude="O").columns.to_list()

    If you execute numeric_vars, you will see the names of the numerical variables: ['A2', 'A3', 'A8', 'A11', 'A14', 'A15'].

  5. Let’s capture the variables’ median values in a dictionary:
    median_values = X_train[
        numeric_vars].median().to_dict()

Tip

Note how we calculate the median using the train set. We will use these values to replace missing data in the train and test sets. To calculate the mean, use pandas mean() instead of median().

If you execute median_values, you will see a dictionary with the median value per variable: {'A2': 28.835, 'A3': 2.75, 'A8': 1.0, 'A11': 0.0, 'A14': 160.0, 'A15': 6.0}.

  1. Let’s replace missing data with the median:
    X_train_t = X_train.fillna(value=median_values)
    X_test_t = X_test.fillna(value=median_values)

    If you execute X_train_t[numeric_vars].isnull().sum() after the imputation, the number of missing values in the numerical variables should be 0.

Note

pandas fillna() returns a new dataset with imputed values by default. To replace missing data in the original DataFrame, set the inplace parameter to True: X_train.fillna(value=median_values, inplace=True).

Now, let’s impute missing values with the median using scikit-learn.

  1. Let’s set up the imputer to replace missing data with the median:
    imputer = SimpleImputer(strategy="median")

Note

To perform mean imputation, set SimpleImputer() as follows: imputer = SimpleImputer(strategy = "mean").

  1. We restrict the imputation to the numerical variables by using ColumnTransformer():
    ct = ColumnTransformer(
        [("imputer", imputer, numeric_vars)],
        remainder="passthrough",
        force_int_remainder_cols=False,
    ).set_output(transform="pandas")

Note

Scikit-learn can return numpy arrays, pandas DataFrames, or polar frames, depending on how we set out the transform output. By default, it returns numpy arrays.

  1. Let’s fit the imputer to the train set so that it learns the median values:
    ct.fit(X_train)
  2. Let’s check out the learned median values:
    ct.named_transformers_.imputer.statistics_

    The previous command returns the median values per variable:

    array([ 28.835,   2.75,   1.,   0., 160.,   6.])
  3. Let’s replace missing values with the median:
    X_train_t = ct.transform(X_train)
    X_test_t = ct.transform(X_test)
  4. Let’s display the resulting training set:
    print(X_train_t.head())

    We see the resulting DataFrame in the following image:

Figure 1.3 – Training set after the imputation. The imputed variables are marked by the imputer prefix; the untransformed variables show the prefix remainder

Figure 1.3 – Training set after the imputation. The imputed variables are marked by the imputer prefix; the untransformed variables show the prefix remainder

Finally, let’s perform median imputation using feature-engine.

  1. Let’s set up the imputer to replace missing data in numerical variables with the median:
    imputer = MeanMedianImputer(
        imputation_method="median",
        variables=numeric_vars,
    )

Note

To perform mean imputation, change imputation_method to "mean". By default MeanMedianImputer() will impute all numerical variables in the DataFrame, ignoring categorical variables. Use the variables argument to restrict the imputation to a subset of numerical variables.

  1. Fit the imputer so that it learns the median values:
    imputer.fit(X_train)
  2. Inspect the learned medians:
    imputer.imputer_dict_

    The previous command returns the median values in a dictionary:

    {'A2': 28.835, 'A3': 2.75, 'A8': 1.0, 'A11': 0.0, 'A14': 160.0, 'A15': 6.0}
  3. Finally, let’s replace the missing values with the median:
    X_train = imputer.transform(X_train)
    X_test = imputer.transform(X_test)

Feature-engine’s MeanMedianImputer() returns a DataFrame. You can check that the imputed variables do not contain missing values using X_train[numeric_vars].isnull().mean().

How it works...

In this recipe, we replaced missing data with the variable’s median values using pandas, scikit-learn, and feature-engine.

We divided the dataset into train and test sets using scikit-learn’s train_test_split() function. The function takes the predictor variables, the target, the fraction of observations to retain in the test set, and a random_state value for reproducibility, as arguments. It returned a train set with 70% of the original observations and a test set with 30% of the original observations. The 70:30 split was done at random.

To impute missing data with pandas, in step 5, we created a dictionary with the numerical variable names as keys and their medians as values. The median values were learned from the training set to avoid data leakage. To replace missing data, we applied pandasfillna() to train and test sets, passing the dictionary with the median values per variable as a parameter.

To replace the missing values with the median using scikit-learn, we used SimpleImputer() with the strategy set to "median". To restrict the imputation to numerical variables, we used ColumnTransformer(). With the remainder argument set to passthrough, we made ColumnTransformer() return all the variables seen in the training set in the transformed output; the imputed ones followed by those that were not transformed.

Note

ColumnTransformer() changes the names of the variables in the output. The transformed variables show the prefix imputer and the unchanged variables show the prefix remainder.

In step 8, we set the output of the column transformer to pandas to obtain a DataFrame as a result. By default, ColumnTransformer() returns numpy arrays.

Note

From version 1.4.0, scikit-learn transformers can return numpy arrays, pandas DataFrames, or polar frames as a result of the transform() method.

With fit(), SimpleImputer() learned the median of each numerical variable in the train set and stored them in its statistics_ attribute. With transform(), it replaced the missing values with the medians.

To replace missing values with the median using Feature-engine, we used the MeanMedianImputer() with the imputation_method set to median. To restrict the imputation to a subset of variables, we passed the variable names in a list to the variables parameter. With fit(), the transformer learned and stored the median values per variable in a dictionary in its imputer_dict_ attribute. With transform(), it replaced the missing values, returning a pandas DataFrame.

Imputing categorical variables

We typically impute categorical variables with the most frequent category, or with a specific string. To avoid data leakage, we find the frequent categories from the train set. Then, we use these values to impute the train, test, and future datasets. scikit-learn and feature-engine find and store the frequent categories for the imputation, out of the box.

In this recipe, we will replace missing data in categorical variables with the most frequent category, or with an arbitrary string.

How to do it...

To begin, let’s make a few imports and prepare the data:

  1. Let’s import pandas and the required functions and classes from scikit-learn and feature-engine:
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.impute import SimpleImputer
    from sklearn.compose import ColumnTransformer
    from feature_engine.imputation import CategoricalImputer
  2. Let’s load the dataset that we prepared in the Technical requirements section:
    data = pd.read_csv("credit_approval_uci.csv")
  3. Let’s split the data into train and test sets and their respective targets:
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop("target", axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  4. Let’s capture the categorical variables in a list:
    categorical_vars = X_train.select_dtypes(
        include="O").columns.to_list()
  5. Let’s store the variables’ most frequent categories in a dictionary:
    frequent_values = X_train[
        categorical_vars].mode().iloc[0].to_dict()
  6. Let’s replace missing values with the frequent categories:
    X_train_t = X_train.fillna(value=frequent_values)
    X_test_t = X_test.fillna(value=frequent_values)

Note

fillna() returns a new DataFrame with the imputed values by default. We can replace missing data in the original DataFrame by executing X_train.fillna(value=frequent_values, inplace=True).

  1. To replace missing data with a specific string, let’s create an imputation dictionary with the categorical variable names as the keys and an arbitrary string as the values:
    imputation_dict = {var:
         "no_data" for var in categorical_vars}

    Now, we can use this dictionary and the code in step 6 to replace missing data.

Note

With pandas value_counts() we can see the string added by the imputation. Try executing, for example, X_train["A1"].value_counts().

Now, let’s impute missing values with the most frequent category using scikit-learn.

  1. Let’s set up the imputer to find the most frequent category per variable:
    imputer = SimpleImputer(strategy='most_frequent')

Note

SimpleImputer() will learn the mode for numerical and categorical variables alike. But in practice, mode imputation is done for categorical variables only.

  1. Let’s restrict the imputation to the categorical variables:
    ct = ColumnTransformer(
        [("imputer",imputer, categorical_vars)],
        remainder="passthrough"
        ).set_output(transform="pandas")

Note

To impute missing data with a string instead of the most frequent category, set SimpleImputer() as follows: imputer = SimpleImputer(strategy="constant", fill_value="missing").

  1. Fit the imputer to the train set so that it learns the most frequent values:
    ct.fit(X_train)
  2. Let’s take a look at the most frequent values learned by the imputer:
    ct.named_transformers_.imputer.statistics_

    The previous command returns the most frequent values per variable:

    array(['b', 'u', 'g', 'c', 'v', 't', 'f', 'f', 'g'], dtype=object)
  3. Finally, let’s replace missing values with the frequent categories:
    X_train_t = ct.transform(X_train)
    X_test_t = ct.transform(X_test)

    Make sure to inspect the resulting DataFrames by executing X_train_t.head().

Note

The ColumnTransformer() changes the names of the variables. The imputed variables show the prefix imputer and the untransformed variables the prefix remainder.

Finally, let’s impute missing values using feature-engine.

  1. Let’s set up the imputer to replace the missing data in categorical variables with their most frequent value:
    imputer = CategoricalImputer(
        imputation_method="frequent",
        variables=categorical_vars,
    )

Note

With the variables parameter set to None, CategoricalImputer() will automatically impute all categorical variables found in the train set. Use this parameter to restrict the imputation to a subset of categorical variables, as shown in step 13.

  1. Fit the imputer to the train set so that it learns the most frequent categories:
    imputer.fit(X_train)

Note

To impute categorical variables with a specific string, set imputation_method to missing and fill_value to the desired string.

  1. Let’s check out the learned categories:
    imputer.imputer_dict_

    We can see the dictionary with the most frequent values in the following output:

    {'A1': 'b',
     'A4': 'u',
     'A5': 'g',
     'A6': 'c',
     'A7': 'v',
     'A9': 't',
     'A10': 'f',
     'A12': 'f',
     'A13': 'g'}
  2. Finally, let’s replace the missing values with frequent categories:
    X_train_t = imputer.transform(X_train)
    X_test_t = imputer.transform(X_test)

    If you want to impute numerical variables with a string or the most frequent value using CategoricalImputer(), set the ignore_format parameter to True.

CategoricalImputer() returns a pandas DataFrame as a result.

How it works...

In this recipe, we replaced missing values in categorical variables with the most frequent categories or an arbitrary string. We used pandas, scikit-learn, and feature-engine.

In step 5, we created a dictionary with the variable names as keys and the frequent categories as values. To capture the frequent categories, we used pandas mode(), and to return a dictionary, we used pandas to_dict(). To replace the missing data, we used pandas fillna(), passing the dictionary with the variables and their frequent categories as parameters. There can be more than one mode in a variable, that’s why we made sure to capture only one of those values by using .iloc[0].

To replace the missing values using scikit-learn, we used SimpleImputer() with the strategy set to most_frequent. To restrict the imputation to categorical variables, we used ColumnTransformer(). With remainder set to passthrough, we made ColumnTransformer() return all the variables present in the training set as a result of the transform() method .

Note

ColumnTransformer() changes the names of the variables in the output. The transformed variables show the prefix imputer and the unchanged variables show the prefix remainder.

With fit(), SimpleImputer() learned the variables’ most frequent categories and stored them in its statistics_ attribute. With transform(), it replaced the missing data with the learned parameters.

SimpleImputer() and ColumnTransformer() return NumPy arrays by default. We can change this behavior with the set_output() parameter.

To replace missing values with feature-engine, we used the CategoricalImputer() with imputation_method set to frequent. With fit(), the transformer learned and stored the most frequent categories in a dictionary in its imputer_dict_ attribute. With transform(), it replaced the missing values with the learned parameters.

Unlike SimpleImputer(), CategoricalImputer() will only impute categorical variables, unless specifically told not to do so by setting the ignore_format parameter to True. In addition, with feature-engine transformers we can restrict the transformations to a subset of variables through the transformer itself. For scikit-learn transformers, we need the additional ColumnTransformer() class to apply the transformation to a subset of the variables.

Replacing missing values with an arbitrary number

We can replace missing data with an arbitrary value. Commonly used values are 999, 9999, or -1 for positive distributions. This method is used for numerical variables. For categorical variables, the equivalent method is to replace missing data with an arbitrary string, as described in the Imputing categorical variables recipe.

When replacing missing values with arbitrary numbers, we need to be careful not to select a value close to the mean, the median, or any other common value of the distribution.

Note

We’d use arbitrary number imputation when data is not missing at random, use non-linear models, or when the percentage of missing data is high. This imputation technique distorts the original variable distribution.

In this recipe, we will impute missing data with arbitrary numbers using pandas, scikit-learn, and feature-engine.

How to do it...

Let’s begin by importing the necessary tools and loading the data:

  1. Import pandas and the required functions and classes:
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.impute import SimpleImputer
    from feature_engine.imputation import ArbitraryNumberImputer
  2. Let’s load the dataset described in the Technical requirements section:
    data = pd.read_csv("credit_approval_uci.csv")
  3. Let’s separate the data into train and test sets:
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop("target", axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )

    We will select arbitrary values greater than the maximum value of the distribution.

  4. Let’s find the maximum value of four numerical variables:
    X_train[['A2','A3', 'A8', 'A11']].max()

    The previous command returns the following output:

    A2     76.750
    A3     26.335
    A8     28.500
    A11    67.000
    dtype: float64

    We’ll use 99 for the imputation because it is bigger than the maximum values of the numerical variables in step 4.

  5. Let’s make a copy of the original DataFrames:
    X_train_t = X_train.copy()
    X_test_t = X_test.copy()
  6. Now, we replace the missing values with 99:
    X_train_t[["A2", "A3", "A8", "A11"]] = X_train_t[[
        "A2", "A3", "A8", "A11"]].fillna(99)
    X_test_t[["A2", "A3", "A8", "A11"]] = X_test_t[[
        "A2", "A3", "A8", "A11"]].fillna(99)

Note

To impute different variables with different values using pandas fillna(), use a dictionary like this: imputation_dict = {"A2": -1, "A3": -1, "A8": 999, "A11": 9999}.

Now, we’ll impute missing values with an arbitrary number using scikit-learn.

  1. Let’s set up imputer to replace missing values with 99:
    imputer = SimpleImputer(strategy='constant', fill_value=99)

Note

If your dataset contains categorical variables, SimpleImputer() will add 99 to those variables as well if any values are missing.

  1. Let’s fit imputer to a slice of the train set containing the variables to impute:
    vars = ["A2", "A3", "A8", "A11"]
    imputer.fit(X_train[vars])
  2. Replace the missing values with 99 in the desired variables:
    X_train_t[vars] = imputer.transform(X_train[vars])
    X_test_t[vars] = imputer.transform(X_test[vars])

    Go ahead and check the lack of missing values by executing X_test_t[["A2", "A3", "A8", "A11"]].isnull().sum().

    To finish, let’s impute missing values using feature-engine.

  3. Let’s set up the imputer to replace missing values with 99 in 4 specific variables:
    imputer = ArbitraryNumberImputer(
        arbitrary_number=99,
        variables=["A2", "A3", "A8", "A11"],
    )

Note

ArbitraryNumberImputer() will automatically select all numerical variables in the train set for imputation if we set the variables parameter to None.

  1. Finally, let’s replace the missing values with 99:
    X_train = imputer.fit_transform(X_train)
    X_test = imputer.transform(X_test)

Note

To impute different variables with different numbers, set up ArbitraryNumberImputer() as follows: ArbitraryNumberImputer(imputater_dict = {"A2": -1, "A3": -1, "A8": 999, "A11": 9999}).

We have now replaced missing data with arbitrary numbers using three different open-source libraries.

How it works...

In this recipe, we replaced missing values in numerical variables with an arbitrary number using pandas, scikit-learn, and feature-engine.

To determine which arbitrary value to use, we inspected the maximum values of four numerical variables using pandas’ max(). We chose 99 because it was greater than the maximum values of the selected variables. In step 5, we used pandas fillna() to replace the missing data.

To replace missing values using scikit-learn, we utilized SimpleImputer(), with the strategy set to constant, and specified 99 in the fill_value argument. Next, we fitted the imputer to a slice of the train set with the numerical variables to impute. Finally, we replaced missing values using transform().

To replace missing values with feature-engine we used ArbitraryValueImputer(), specifying the value 99 and the variables to impute as parameters. Next, we applied the fit_transform() method to replace missing data in the train set and the transform() method to replace missing data in the test set.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Craft powerful features from tabular, transactional, and time-series data
  • Develop efficient and reproducible real-world feature engineering pipelines
  • Optimize data transformation and save valuable time
  • Purchase of the print or Kindle book includes a free PDF eBook

Description

Streamline data preprocessing and feature engineering in your machine learning project with this third edition of the Python Feature Engineering Cookbook to make your data preparation more efficient. This guide addresses common challenges, such as imputing missing values and encoding categorical variables using practical solutions and open source Python libraries. You’ll learn advanced techniques for transforming numerical variables, discretizing variables, and dealing with outliers. Each chapter offers step-by-step instructions and real-world examples, helping you understand when and how to apply various transformations for well-prepared data. The book explores feature extraction from complex data types such as dates, times, and text. You’ll see how to create new features through mathematical operations and decision trees and use advanced tools like Featuretools and tsfresh to extract features from relational data and time series. By the end, you’ll be ready to build reproducible feature engineering pipelines that can be easily deployed into production, optimizing data preprocessing workflows and enhancing machine learning model performance.

Who is this book for?

If you're a machine learning or data science enthusiast who wants to learn more about feature engineering, data preprocessing, and how to optimize these tasks, this book is for you. If you already know the basics of feature engineering and are looking to learn more advanced methods to craft powerful features, this book will help you. You should have basic knowledge of Python programming and machine learning to get started.

What you will learn

  • Discover multiple methods to impute missing data effectively
  • Encode categorical variables while tackling high cardinality
  • Find out how to properly transform, discretize, and scale your variables
  • Automate feature extraction from date and time data
  • Combine variables strategically to create new and powerful features
  • Extract features from transactional data and time series
  • Learn methods to extract meaningful features from text data
Estimated delivery fee Deliver to Belgium

Premium delivery 7 - 10 business days

€17.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Aug 30, 2024
Length: 396 pages
Edition : 3rd
Language : English
ISBN-13 : 9781835883587
Category :
Languages :
Concepts :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Redeem a companion digital copy on all Print orders
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Belgium

Premium delivery 7 - 10 business days

€17.95
(Includes tracking information)

Product Details

Publication date : Aug 30, 2024
Length: 396 pages
Edition : 3rd
Language : English
ISBN-13 : 9781835883587
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 114.97
AI-Assisted Programming for Web and Machine Learning
€35.99
Python for Algorithmic Trading Cookbook
€44.99
Python Feature Engineering Cookbook
€33.99
Total 114.97 Stars icon

Table of Contents

13 Chapters
Chapter 1: Imputing Missing Data Chevron down icon Chevron up icon
Chapter 2: Encoding Categorical Variables Chevron down icon Chevron up icon
Chapter 3: Transforming Numerical Variables Chevron down icon Chevron up icon
Chapter 4: Performing Variable Discretization Chevron down icon Chevron up icon
Chapter 5: Working with Outliers Chevron down icon Chevron up icon
Chapter 6: Extracting Features from Date and Time Variables Chevron down icon Chevron up icon
Chapter 7: Performing Feature Scaling Chevron down icon Chevron up icon
Chapter 8: Creating New Features Chevron down icon Chevron up icon
Chapter 9: Extracting Features from Relational Data with Featuretools Chevron down icon Chevron up icon
Chapter 10: Creating Features from a Time Series with tsfresh Chevron down icon Chevron up icon
Chapter 11: Extracting Features from Text Variables Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the digital copy I get with my Print order? Chevron down icon Chevron up icon

When you buy any Print edition of our Books, you can redeem (for free) the eBook edition of the Print Book you’ve purchased. This gives you instant access to your book when you make an order via PDF, EPUB or our online Reader experience.

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela