Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Feature Engineering Cookbook

You're reading from   Python Feature Engineering Cookbook A complete guide to crafting powerful features for your machine learning models

Arrow left icon
Product type Paperback
Published in Aug 2024
Publisher Packt
ISBN-13 9781835883587
Length 396 pages
Edition 3rd Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Soledad Galli Soledad Galli
Author Profile Icon Soledad Galli
Soledad Galli
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Chapter 1: Imputing Missing Data 2. Chapter 2: Encoding Categorical Variables FREE CHAPTER 3. Chapter 3: Transforming Numerical Variables 4. Chapter 4: Performing Variable Discretization 5. Chapter 5: Working with Outliers 6. Chapter 6: Extracting Features from Date and Time Variables 7. Chapter 7: Performing Feature Scaling 8. Chapter 8: Creating New Features 9. Chapter 9: Extracting Features from Relational Data with Featuretools 10. Chapter 10: Creating Features from a Time Series with tsfresh 11. Chapter 11: Extracting Features from Text Variables 12. Index 13. Other Books You May Enjoy

Carrying out interpolation

We can impute missing data in time series by using interpolation between two non-missing data points. Interpolation is the estimation of one or more values in a range by means of a function. In linear interpolation, we fit a linear function between the last observed value and the next valid point. In spline interpolation, we fit a low-degree polynomial between the last and next observed values. The idea of using interpolation is to obtain better estimates of the missing data.

In this recipe, we’ll carry out linear and spline interpolation in a time series.

How to do it...

Let’s begin by importing the required libraries and time series dataset.

  1. Let’s import pandas and matplotlib:
    import matplotlib.pyplot as plt
    import pandas as pd
  2. Let’s load the time series data described in the Technical requirements section:
    df = pd.read_csv(
        "air_passengers.csv",
        parse_dates=["ds"],
        index_col=["ds"],
    )

Note

You can plot the time series to find data gaps as we did in step 3 of the Implementing forward and backward fill recipe.

  1. Let’s impute the missing data by linear interpolation:
    df_imputed = df.interpolate(method="linear")

Note

If the time intervals between rows are not uniform then the method should be set to time to achieve a linear fit.

You can verify the absence of missing data by executing df_imputed.isnull().sum().

  1. Let’s now plot the complete dataset and overlay as a dotted line the values used for the imputation:
    ax = df_imputed.plot(
        linestyle="-", marker=".", figsize=[10, 5])
    df_imputed[df.isnull()].plot(
        ax=ax, legend=None, marker=".", color="r")
    ax.set_title("Air passengers")
    ax.set_ylabel("Number of passengers")
    ax.set_xlabel("Time")

    The previous code returns the following plot, where we see the values used to replace nan as dotted lines in between the continuous line of the time series:

Figure 1.8 – Time series data where missing values were replaced by linear interpolation between the last and next valid data points (dotted line)

Figure 1.8 – Time series data where missing values were replaced by linear interpolation between the last and next valid data points (dotted line)

  1. Alternatively, we can impute missing data by doing spline interpolation. We’ll use a polynomial of the second degree:
    df_imputed = df.interpolate(method="spline", order=2)

    If we plot the imputed dataset and overlay the imputation values as we did in step 4, we’ll see the following plot:

Figure 1.9 – Time series data where missing values were replaced by fitting a second-degree polynomial between the last and next valid data points (dotted line)

Figure 1.9 – Time series data where missing values were replaced by fitting a second-degree polynomial between the last and next valid data points (dotted line)

Note

Change the degree of the polynomial used in the interpolation to see how the replacement values vary.

We’ve now obtained complete datasets that we can use for analysis and modeling.

How it works...

pandas interpolate() fills missing values in a range by using an interpolation method. When we set the method to linear, interpolate() treats all data points as equidistant and fits a line between the last and next valid points in an interval with missing data.

Note

If you want to perform linear interpolation, but your data points are not equally distanced, set method to time.

We then performed spline interpolation with a second-degree polynomial by setting method to spline and order to 2.

pandas interpolate() uses scipy.interpolate.interp1d and scipy.interpolate.UnivariateSpline under the hood, and can therefore implement other interpolation methods. Check out pandas documentation for more details at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html.

See also

While interpolation aims to get better estimates of the missing data compared to forward and backward fill, these estimates may still not be accurate if the times series show strong trend and seasonality. To obtain better estimates of the missing data in these types of time series, check out time series decomposition followed by interpolation in the Feature Engineering for Time Series Course at https://www.trainindata.com/p/feature-engineering-for-forecasting.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime