Dealing with heteroskedasticity
In this recipe, we delve into the variance of time series. The variance of a time series is a measure of how spread out the data is and how this dispersion evolves over time. You’ll learn how to handle data with a changing variance.
Getting ready
The variance of time series can change over time, which also violates stationarity. In such cases, the time series is referred to as heteroskedastic and usually shows a long-tailed distribution. This means the data is left- or right-skewed. This condition is problematic because it impacts the training of neural networks and other models.
How to do it…
Dealing with non-constant variance is a two-step process. First, we use statistical tests to check whether a time series is heteroskedastic. Then, we use transformations such as the logarithm to stabilize the variance.
We can detect heteroskedasticity using statistical tests such as the White test or the Breusch-Pagan test. The following code implements these tests based on the statsmodels
library:
import statsmodels.stats.api as sms from statsmodels.formula.api import ols series_df = series_daily.reset_index(drop=True).reset_index() series_df.columns = ['time', 'value'] series_df['time'] += 1 olsr = ols('value ~ time', series_df).fit() _, pval_white, _, _ = sms.het_white(olsr.resid, olsr.model.exog) _, pval_bp, _, _ = sms.het_breuschpagan(olsr.resid, olsr.model.exog)
The preceding code follows these steps:
- Import the
statsmodels
modulesols
andstats
. - Create a DataFrame based on the values of the time series and the row they were collected at (
1
for the first observation). - Create a linear model that relates the values of the time series with the
time
column. - Run
het_white
(White) andhet_breuschpagan
(Breusch-Pagan) to apply the variance tests.
The output of the tests is a p-value, where the null hypothesis posits that the time series has constant variance. So, if the p-value is below the significance value, we reject the null hypothesis and assume heteroskedasticity.
The simplest way to deal with non-constant variance is by transforming the data using the logarithm. This operation can be implemented as follows:
import numpy as np class LogTransformation: @staticmethod def transform(x): xt = np.sign(x) * np.log(np.abs(x) + 1) return xt @staticmethod def inverse_transform(xt): x = np.sign(xt) * (np.exp(np.abs(xt)) - 1) return x
The preceding code is a Python class called LogTransformation
. It contains two methods: transform
()
and inverse_transform
()
. The first transforms the data using the logarithm and the second reverts that operation.
We apply the transform
()
method to the time series as follows:
series_log = LogTransformation.transform(series_daily)
The log is a particular case of Box-Cox transformation that is available in the scipy
library. You can implement this method as follows:
series_transformed, lmbda = stats.boxcox(series_daily)
The stats.boxcox
()
method estimates a transformation parameter, lmbda
, which can be used to revert the operation.
How it works…
The transformations outlined in this recipe stabilize the variance of a time series. They also bring the data distribution closer to the Normal
distribution. These transformations are especially useful for neural networks as they help avoid saturation areas. In neural networks, saturation occurs when the model becomes insensitive to different inputs, thus compromising the training process.
There’s more…
The Yeo-Johnson power transformation is similar to the Box-Cox transformation but allows for negative values in the time series. You can learn more about this method with the following link: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.yeojohnson.html.
See also
You can learn more about the importance of the logarithm transformation in the following reference:
Bandara, Kasun, Christoph Bergmeir, and Slawek Smyl. “Forecasting across time series databases using recurrent neural networks on groups of similar series: A clustering approach.” Expert Systems with Applications 140 (2020): 112896.