Carrying out interpolation
We can impute missing data in time series by using interpolation between two non-missing data points. Interpolation is the estimation of one or more values in a range by means of a function. In linear interpolation, we fit a linear function between the last observed value and the next valid point. In spline interpolation, we fit a low-degree polynomial between the last and next observed values. The idea of using interpolation is to obtain better estimates of the missing data.
In this recipe, we’ll carry out linear and spline interpolation in a time series.
How to do it...
Let’s begin by importing the required libraries and time series dataset.
- Let’s import
pandas
andmatplotlib
:import matplotlib.pyplot as plt import pandas as pd
- Let’s load the time series data described in the Technical requirements section:
df = pd.read_csv( "air_passengers.csv", parse_dates=["ds"], index_col=["ds"], )
Note
You can plot the time series to find data gaps as we did in step 3 of the Implementing forward and backward fill recipe.
- Let’s impute the missing data by linear interpolation:
df_imputed = df.interpolate(method="linear")
Note
If the time intervals between rows are not uniform then the method
should be set to time
to achieve a linear fit.
You can verify the absence of missing data by executing df_imputed.isnull().sum()
.
- Let’s now plot the complete dataset and overlay as a dotted line the values used for the imputation:
ax = df_imputed.plot( linestyle="-", marker=".", figsize=[10, 5]) df_imputed[df.isnull()].plot( ax=ax, legend=None, marker=".", color="r") ax.set_title("Air passengers") ax.set_ylabel("Number of passengers") ax.set_xlabel("Time")
The previous code returns the following plot, where we see the values used to replace
nan
as dotted lines in between the continuous line of the time series:
Figure 1.8 – Time series data where missing values were replaced by linear interpolation between the last and next valid data points (dotted line)
- Alternatively, we can impute missing data by doing spline interpolation. We’ll use a polynomial of the second degree:
df_imputed = df.interpolate(method="spline", order=2)
If we plot the imputed dataset and overlay the imputation values as we did in step 4, we’ll see the following plot:
Figure 1.9 – Time series data where missing values were replaced by fitting a second-degree polynomial between the last and next valid data points (dotted line)
Note
Change the degree of the polynomial used in the interpolation to see how the replacement values vary.
We’ve now obtained complete datasets that we can use for analysis and modeling.
How it works...
pandas
interpolate()
fills missing values in a range by using an interpolation method. When we set the method
to linear
, interpolate()
treats all data points as equidistant and fits a line between the last and next valid points in an interval with missing data.
Note
If you want to perform linear interpolation, but your data points are not equally distanced, set method
to time
.
We then performed spline interpolation with a second-degree polynomial by setting method
to spline
and order
to 2
.
pandas
interpolate()
uses scipy.interpolate.interp1d
and scipy.interpolate.UnivariateSpline
under the hood, and can therefore implement other interpolation methods. Check out pandas documentation for more details at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html.
See also
While interpolation aims to get better estimates of the missing data compared to forward and backward fill, these estimates may still not be accurate if the times series show strong trend and seasonality. To obtain better estimates of the missing data in these types of time series, check out time series decomposition followed by interpolation in the Feature Engineering for Time Series Course at https://www.trainindata.com/p/feature-engineering-for-forecasting.