You're reading from Hyperparameter Tuning with Python

Product type Book

Published in Jul 2022

Publisher Packt

ISBN-13 9781803235875

Pages 306 pages

Edition 1st Edition

Languages

Concepts

Machine Learning

Author (1):

Louis Owen

Table of Contents (19) Chapters

Preface

1. Section 1:The Methods

2. Chapter 1: Evaluating Machine Learning Models

3. Chapter 2: Introducing Hyperparameter Tuning

4. Chapter 3: Exploring Exhaustive Search

5. Chapter 4: Exploring Bayesian Optimization

6. Chapter 5: Exploring Heuristic Search

7. Chapter 6: Exploring Multi-Fidelity Optimization

8. Section 2:The Implementation

9. Chapter 7: Hyperparameter Tuning via Scikit

10. Chapter 8: Hyperparameter Tuning via Hyperopt

11. Chapter 9: Hyperparameter Tuning via Optuna

12. Chapter 10: Advanced Hyperparameter Tuning with DEAP and Microsoft NNI

13. Section 3:Putting Things into Practice

14. Chapter 11: Understanding the Hyperparameters of Popular Algorithms

15. Chapter 12: Introducing Hyperparameter Tuning Decision Map

16. Chapter 13: Tracking Hyperparameter Tuning Experiments

17. Chapter 14: Conclusions and Next Steps

18. Other Books You May Enjoy

Discovering time-series cross-validation

Time-series data has a unique characteristic in nature. Unlike "normal" data, which is assumed to be independent and identically distributed (IID), time-series data does not follow that assumption. In fact, each sample is dependent on previous samples, meaning changing the order of the samples will result in different data interpretations.

Several examples of time-series data are listed as follows:

Daily stock market price
Hourly temperature data
Minute-by-minute web page clicks count

There will be a look-ahead bias if we apply previous cross-validation strategies (for example, k-fold or random or stratified splits) to time-series data. Look-ahead bias happens when we use the future value of the data that is supposedly not available for the current time of the simulation.

For instance, we are working with hourly temperature data. We want to predict what the temperature will be in 2 hours, but we use the temperature value of the next hour or the next 3 hours, which is supposedly not available yet. This kind of bias will happen easily if we apply the previous cross-validation strategies since those strategies are designed to work well only on IID distribution.

Time-series cross-validation is the cross-validation strategy that is specifically designed to handle time-series data. It works similarly to k-fold in terms of accepting the predefined values of folds, which then generates k test sets. The difference is that the data is not shuffled in the first place, and the training set in the next iteration is the superset of the one in the previous iteration, meaning the training set keeps getting bigger over the number of iterations. Once we finish with the cross-validation and get the final model configuration, we can then test our final model on the test data (see Figure 1.4):

Figure 1.4 – Time-series cross-validation

Also, the Scikit-Learn package provides us with a nice implementation of this strategy:

from sklearn.model_selection import train_test_split, TimeSeriesSplit
df_cv, df_test = train_test_split(df, test_size=0.2, random_state=0, shuffle=False)
tscv = TimeSeriesSplit(n_splits=5)
for train_index, val_index in tscv.split(df_cv):
df_train, df_val = df_cv.iloc[train_index], df_cv.iloc[val_index]
#perform training or hyperparameter tuning here

Providing n_splits=5 will ensure that there are five test sets generated. It is worth noting that, by default, the train set will have the size of for the ith fold, while the test set will have the size of .

However, you can change the train and test set size via the max_train_size and test_size arguments of the TimeSeriesSplit function. Additionally, there is also a gap argument that can be utilized to exclude G samples from the end of each train set, where G is the value needed to be specified by the developer.

You need to be aware that the Scikit-Learn implementation will always make sure that there is no overlap between test sets, which is actually not necessary. Currently, there is no way to enable the overlap between the test sets using the Scikit-Learn implementation. You need to write the code from scratch to perform that kind of strategy.

In this section, we learned about the unique characteristic of time-series data and how to perform a cross-validation strategy on it. There are other variations of the cross-validation strategy that haven't been covered in this book. If you are interested, you might find some pointers in the Further reading section.

You're reading from Hyperparameter Tuning with Python

Table of Contents (19) Chapters

Discovering time-series cross-validation

Authors (1)

Personalised recommendations for you

You're reading from Hyperparameter Tuning with Python

Table of Contents (19) Chapters close

Discovering time-series cross-validation

Authors (1)

Personalised recommendations for you

Table of Contents (19) Chapters