You're reading from The Regularization Cookbook Explore practical recipes to improve the functionality of your ML models

Product type Paperback

Published in Jul 2023

Publisher Packt

ISBN-13 9781837634088

Length 424 pages

Edition 1st Edition

Languages

Ring

Tools

Astro

Concepts

Machine Learning

Author (1):

Vincent Vandenbussche

View More author details

Table of Contents (14) Chapters

Preface

1. Chapter 1: An Overview of Regularization

2. Chapter 2: Machine Learning Refresher FREE CHAPTER

3. Chapter 3: Regularization with Linear Models

4. Chapter 4: Regularization with Tree-Based Models

5. Chapter 5: Regularization with Data

6. Chapter 6: Deep Learning Reminders

7. Chapter 7: Deep Learning Regularization

8. Chapter 8: Regularization with Recurrent Neural Networks

9. Chapter 9: Advanced Regularization in Natural Language Processing

10. Chapter 10: Regularization in Computer Vision

11. Chapter 11: Regularization in Computer Vision – Synthetic Image Generation

12. Index

Why subscribe?

13. Other Books You May Enjoy

Splitting data

After loading data, splitting it is a crucial step. This recipe will explain why we need to split data, as well as how to do it.

Getting ready

Why do we need to split data? An ML model is quite like a student.

You provide a student with many lectures and exercises, with or without the answers. But more often than not, students are evaluated on a completely new problem. To make sure they fully understand the concepts and methods, they not only learn the exercises and solutions – they also understand the underlying concepts.

An ML model is no different: you train the model on training data and then evaluate it on test data. This way, you make sure the model fully understands the task and generalizes well to new, unseen data.

So, the dataset is usually split into train and test sets:

The train set must be as large as possible to give as many samples as possible to the model
The test set must be large enough to be statistically significant in evaluating the model

Typical splits can be anywhere between 80% to 20% for rather small datasets (for example, hundreds of samples), and 99% to 1% for very large datasets (for example, millions of samples and more).

For this recipe and the others in this chapter, it is assumed that the code has been executed in the same notebook as the previous recipe since each recipe reuses the code from the previous ones.

How to do it…

Here are the steps to try out this recipe:

You can split the data rather easily with scikit-learn and the train_test_split() function:

# Import the train_test_split function

from sklearn.model_selection import train_test_split

# Split the data

X_train, X_test, y_train, y_test = train_test_split(

    df.drop(columns=['Survived']), df['Survived'],

    test_size=0.2, stratify=df['Survived'],

    random_state=0)

This function uses the following parameters as input:

X: All columns but the 'Survived' label
y: The 'Survived' label column
test_size: This is 0.2, which means the training size will be 80%
stratify: This specifies the 'Survived' column to ensure the same label balance is used in both splits
random_state: 0 is any integer to ensure reproducibility

It returns the following outputs:

X_train: The train split of X
X_test: The test split of X
y_train: The training split of y, associated with X_train
y_test: The test split of y, associated with X_test

Note

The stratify option is not mandatory but can be critical to ensure a balanced split of any qualitative feature, not just the labels, as is the case with imbalanced data.

This split should be done as early as possible when performing data processing so that you avoid any potential data leakage. From now on, all the preprocessing will be computed on the train set, and only then applied to the test set, in agreement with Figure 2.2.

You're reading from The Regularization Cookbook Explore practical recipes to improve the functionality of your ML models

Table of Contents (14) Chapters

Splitting data

Getting ready

How to do it…

See also

Authors (1)

Personalised recommendations for you