You're reading from Python Feature Engineering Cookbook A complete guide to crafting powerful features for your machine learning models

Product type Paperback

Published in Aug 2024

Publisher Packt

ISBN-13 9781835883587

Length 396 pages

Edition 3rd Edition

Languages

Python

Tools

Combine

Concepts

Data Science

Author (1):

Soledad Galli

View More author details

Table of Contents (14) Chapters

Preface

1. Chapter 1: Imputing Missing Data

2. Chapter 2: Encoding Categorical Variables FREE CHAPTER

3. Chapter 3: Transforming Numerical Variables

4. Chapter 4: Performing Variable Discretization

5. Chapter 5: Working with Outliers

6. Chapter 6: Extracting Features from Date and Time Variables

7. Chapter 7: Performing Feature Scaling

8. Chapter 8: Creating New Features

9. Chapter 9: Extracting Features from Relational Data with Featuretools

10. Chapter 10: Creating Features from a Time Series with tsfresh

11. Chapter 11: Extracting Features from Text Variables

12. Index

Why subscribe?

13. Other Books You May Enjoy

Carrying out interpolation

We can impute missing data in time series by using interpolation between two non-missing data points. Interpolation is the estimation of one or more values in a range by means of a function. In linear interpolation, we fit a linear function between the last observed value and the next valid point. In spline interpolation, we fit a low-degree polynomial between the last and next observed values. The idea of using interpolation is to obtain better estimates of the missing data.

In this recipe, we’ll carry out linear and spline interpolation in a time series.

How to do it...

Let’s begin by importing the required libraries and time series dataset.

Let’s import pandas and matplotlib:

import matplotlib.pyplot as plt
import pandas as pd

Let’s load the time series data described in the Technical requirements section:

df = pd.read_csv(
    "air_passengers.csv",
    parse_dates=["ds"],
    index_col=["ds"],
)

Note

You can plot the time series to find data gaps as we did in step 3 of the Implementing forward and backward fill recipe.

Let’s impute the missing data by linear interpolation:
```
df_imputed = df.interpolate(method="linear")
```

Note

If the time intervals between rows are not uniform then the method should be set to time to achieve a linear fit.

You can verify the absence of missing data by executing df_imputed.isnull().sum().

Let’s now plot the complete dataset and overlay as a dotted line the values used for the imputation:
```
ax = df_imputed.plot(
    linestyle="-", marker=".", figsize=[10, 5])
df_imputed[df.isnull()].plot(
    ax=ax, legend=None, marker=".", color="r")
ax.set_title("Air passengers")
ax.set_ylabel("Number of passengers")
ax.set_xlabel("Time")
```
The previous code returns the following plot, where we see the values used to replace nan as dotted lines in between the continuous line of the time series:

Figure 1.8 – Time series data where missing values were replaced by linear interpolation between the last and next valid data points (dotted line)

Alternatively, we can impute missing data by doing spline interpolation. We’ll use a polynomial of the second degree:
```
df_imputed = df.interpolate(method="spline", order=2)
```
If we plot the imputed dataset and overlay the imputation values as we did in step 4, we’ll see the following plot:

Figure 1.9 – Time series data where missing values were replaced by fitting a second-degree polynomial between the last and next valid data points (dotted line)

Note

Change the degree of the polynomial used in the interpolation to see how the replacement values vary.

We’ve now obtained complete datasets that we can use for analysis and modeling.

How it works...

pandas interpolate() fills missing values in a range by using an interpolation method. When we set the method to linear, interpolate() treats all data points as equidistant and fits a line between the last and next valid points in an interval with missing data.

Note

If you want to perform linear interpolation, but your data points are not equally distanced, set method to time.

We then performed spline interpolation with a second-degree polynomial by setting method to spline and order to 2.

pandas interpolate() uses scipy.interpolate.interp1d and scipy.interpolate.UnivariateSpline under the hood, and can therefore implement other interpolation methods. Check out pandas documentation for more details at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html.