When a time series exhibits an upward or downward movement in the long run, it is said to have a general trend. A quick way to check the presence of general trend is to plot the time series as in the following figure, which shows CO2 concentrations in air measured during 1974 through 1987:
Figure 1.5: Time series of CO2 readings with an upward trend
However, general trend might not be evident over a short run of the series. Short run effects such as seasonal fluctuations and irregular variations cause the time series to revisit lower or higher values observed in the past and hence can temporarily obfuscate any general trend. This is evident in the same time series of CO2 concentrations when zoomed in over the period of 1979 through 1981, as shown in the following figure. Hence to reveal general trend, we need a time series that dates substantially back in the past.
Figure 1.6: Shorter run of CO2 readings time series which is not able to reveal general trend
The general trend in the time series is due to fundamental shifts or systemic changes of the process or system it represents. For example, the upward movement of CO2 concentrations during 1974 through 1987 can be attributed to the gradual rise in automobiles and industrialization over these years.
A general trend is commonly modeled by setting up the time series as a regression against time and other known factors as explanatory variables. The regression or trend line can then be used as a prediction of the long run movement of the time series. Residuals left by the trend line is further analyzed for other interesting properties such as seasonality, cyclical behavior, and irregular variations.
Now, let us go through the code that generated the preceding plots on CO2 concentrations. We will also show how to build a trend model using linear regression on the time index (which in this case is the index of the year in the data) as explanatory variable and the CO2 concentration as the dependent variable. But first, let us load the data in a pandas.DataFrame.
The data for this example is in the Excel file Monthly_CO2_Concentrations.xlsx under the datasets folder of the GitHub repo.
We start by importing the required packages as follows:
from __future__ import print_function
import os
import pandas as pd
import numpy as np
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
os.chdir('D:\Practical Time Series')
data = pd.read_excel('datasets/Monthly_CO2_Concentrations.xlsx', converters={'Year': np.int32, 'Month': np.int32})
data.head()
We have passed the argument converters to the read_excel function in order to make sure that columns Year and Month are assigned the integer (np.int32) datatype. The preceding lines of code will generate the following table:
|
CO2
|
Year
|
Month
|
0
|
333.13
|
1974
|
5
|
1
|
332.09
|
1974
|
6
|
2
|
331.10
|
1974
|
7
|
3
|
329.14
|
1974
|
8
|
4
|
327.36
|
1974
|
9
|
Before plotting we must remove all columns having missing values. Besides, the DataFrame is sorted in ascending order of Year and Month. These are done as follows:
data = data.ix[(~pd.isnull(data['CO2']))&\
(~pd.isnull(data['Year']))&\
(~pd.isnull(data['Month']))]
data.sort_values(['Year', 'Month'], inplace=True)
Finally, the plot for the time period 1974 to 1987 is generated by executing the following lines:
plt.figure(figsize=(5.5, 5.5))
data['CO2'].plot(color='b')
plt.title('Monthly CO2 concentrations')
plt.xlabel('Time')
plt.ylabel('CO2 concentratition')
plt.xticks(rotation=30)
The zoomed-in version of the data for the time period 1980 to 1981 is generated by after the DataFrame for these three years:
plt.figure(figsize=(5.5, 5.5))
data['CO2'].loc[(data['Year']==1980) | (data['Year']==1981)].plot(color='b')
plt.title('Monthly CO2 concentrations')
plt.xlabel('Time')
plt.ylabel('CO2 concentratition')
plt.xticks(rotation=30)
Next, let us fit the trend line. For this we import the LinearRegression class from scikit-learn and fit a linear model on the time index:
from sklearn.linear_model import LinearRegression
trend_model = LinearRegression(normalize=True, fit_intercept=True)
trend_model.fit(np.array(data.index).reshape((-1,1)), data['CO2'])
print('Trend model coefficient={} and intercept={}'.format(trend_model.coef_[0],
trend_model.intercept_)
)
This produces the following output:
Trend model coefficient=0.111822078545 and intercept=329.455422234
The residuals obtained from the trend line model are shown in the following figure and appear to have seasonal behaviour, which is discussed in the next sub section.
The residuals are calculated and plotted in the preceding by the following lines of code:
residuals = np.array(data['CO2']) - trend_model.predict(np.array(data.index).reshape((-1,1)))
plt.figure(figsize=(5.5, 5.5))
pd.Series(data=residuals, index=data.index).plot(color='b')
plt.title('Residuals of trend model for CO2 concentrations')
plt.xlabel('Time')
plt.ylabel('CO2 concentratition')
plt.xticks(rotation=30)
Figure 1.7: Residuals from a linear model of the general trend in CO2 readings