Search icon CANCEL
Subscription
0
Cart icon
Cart
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Hyperparameter Tuning with Python

You're reading from  Hyperparameter Tuning with Python

Product type Book
Published in Jul 2022
Publisher Packt
ISBN-13 9781803235875
Pages 306 pages
Edition 1st Edition
Languages
Author (1):
Louis Owen Louis Owen
Profile icon Louis Owen
Toc

Table of Contents (19) Chapters close

Preface 1. Section 1:The Methods
2. Chapter 1: Evaluating Machine Learning Models 3. Chapter 2: Introducing Hyperparameter Tuning 4. Chapter 3: Exploring Exhaustive Search 5. Chapter 4: Exploring Bayesian Optimization 6. Chapter 5: Exploring Heuristic Search 7. Chapter 6: Exploring Multi-Fidelity Optimization 8. Section 2:The Implementation
9. Chapter 7: Hyperparameter Tuning via Scikit 10. Chapter 8: Hyperparameter Tuning via Hyperopt 11. Chapter 9: Hyperparameter Tuning via Optuna 12. Chapter 10: Advanced Hyperparameter Tuning with DEAP and Microsoft NNI 13. Section 3:Putting Things into Practice
14. Chapter 11: Understanding the Hyperparameters of Popular Algorithms 15. Chapter 12: Introducing Hyperparameter Tuning Decision Map 16. Chapter 13: Tracking Hyperparameter Tuning Experiments 17. Chapter 14: Conclusions and Next Steps 18. Other Books You May Enjoy

Discovering time-series cross-validation

Time-series data has a unique characteristic in nature. Unlike "normal" data, which is assumed to be independent and identically distributed (IID), time-series data does not follow that assumption. In fact, each sample is dependent on previous samples, meaning changing the order of the samples will result in different data interpretations.

Several examples of time-series data are listed as follows:

  • Daily stock market price
  • Hourly temperature data
  • Minute-by-minute web page clicks count

There will be a look-ahead bias if we apply previous cross-validation strategies (for example, k-fold or random or stratified splits) to time-series data. Look-ahead bias happens when we use the future value of the data that is supposedly not available for the current time of the simulation.

For instance, we are working with hourly temperature data. We want to predict what the temperature will be in 2 hours, but we use the temperature value of the next hour or the next 3 hours, which is supposedly not available yet. This kind of bias will happen easily if we apply the previous cross-validation strategies since those strategies are designed to work well only on IID distribution.

Time-series cross-validation is the cross-validation strategy that is specifically designed to handle time-series data. It works similarly to k-fold in terms of accepting the predefined values of folds, which then generates k test sets. The difference is that the data is not shuffled in the first place, and the training set in the next iteration is the superset of the one in the previous iteration, meaning the training set keeps getting bigger over the number of iterations. Once we finish with the cross-validation and get the final model configuration, we can then test our final model on the test data (see Figure 1.4):

Figure 1.4 – Time-series cross-validation

Figure 1.4 – Time-series cross-validation

Also, the Scikit-Learn package provides us with a nice implementation of this strategy:

from sklearn.model_selection import train_test_split, TimeSeriesSplit
df_cv, df_test = train_test_split(df, test_size=0.2, random_state=0, shuffle=False)
tscv = TimeSeriesSplit(n_splits=5)
for train_index, val_index in tscv.split(df_cv):
df_train, df_val = df_cv.iloc[train_index], df_cv.iloc[val_index]
#perform training or hyperparameter tuning here

Providing n_splits=5 will ensure that there are five test sets generated. It is worth noting that, by default, the train set will have the size of for the ith fold, while the test set will have the size of .

However, you can change the train and test set size via the max_train_size and test_size arguments of the TimeSeriesSplit function. Additionally, there is also a gap argument that can be utilized to exclude G samples from the end of each train set, where G is the value needed to be specified by the developer.

You need to be aware that the Scikit-Learn implementation will always make sure that there is no overlap between test sets, which is actually not necessary. Currently, there is no way to enable the overlap between the test sets using the Scikit-Learn implementation. You need to write the code from scratch to perform that kind of strategy.

In this section, we learned about the unique characteristic of time-series data and how to perform a cross-validation strategy on it. There are other variations of the cross-validation strategy that haven't been covered in this book. If you are interested, you might find some pointers in the Further reading section.

You have been reading a chapter from
Hyperparameter Tuning with Python
Published in: Jul 2022 Publisher: Packt ISBN-13: 9781803235875
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime