You're reading from The Regularization Cookbook Explore practical recipes to improve the functionality of your ML models

Product type Paperback

Published in Jul 2023

Publisher Packt

ISBN-13 9781837634088

Length 424 pages

Edition 1st Edition

Languages

Ring

Tools

Astro

Concepts

Machine Learning

Author (1):

Vincent Vandenbussche

View More author details

Table of Contents (14) Chapters

Preface

1. Chapter 1: An Overview of Regularization

2. Chapter 2: Machine Learning Refresher FREE CHAPTER

3. Chapter 3: Regularization with Linear Models

4. Chapter 4: Regularization with Tree-Based Models

5. Chapter 5: Regularization with Data

6. Chapter 6: Deep Learning Reminders

7. Chapter 7: Deep Learning Regularization

8. Chapter 8: Regularization with Recurrent Neural Networks

9. Chapter 9: Advanced Regularization in Natural Language Processing

10. Chapter 10: Regularization in Computer Vision

11. Chapter 11: Regularization in Computer Vision – Synthetic Image Generation

12. Index

Why subscribe?

13. Other Books You May Enjoy

Preparing quantitative data

Depending on the type of data, how the features must be prepared may differ. In this recipe, we’ll cover how to prepare quantitative data, including missing data imputation and rescaling.

Getting ready

In the Titanic dataset, as well as any other dataset, there may be missing data. There are several ways to deal with missing data. For example, you can drop a column or a row, or impute a value. There are many imputation techniques, some of which are more or less sophisticated. scikit-learn supplies several implementations of imputers, such as SimpleImputer and KNNImputer.

As we will see in this recipe, using SimpleImputer, we can impute the missing quantitative data with the mean value.

Once the missing data has been handled, we can prepare the quantitative data by rescaling it so that all the data is at the same scale.

Several rescaling strategies exist, such as min-max scaling, robust scaling, standard scaling, and others.

In this recipe, we will use standard scaling. So, for each feature, we will subtract the mean value of this feature, and then divide it by the standard deviation of that feature:

Fortunately, scikit-learn provides a fully working implementation via StandardScaler.

How to do it…

We will sequentially handle missing values and rescale the data in this recipe:

Import the required classes – SimpleImputer for missing data imputation and StandardScaler for rescaling:
```
from sklearn.impute import SimpleImputer
```
```
from sklearn.preprocessing import StandardScaler
```
Select the quantitative features we want to keep. Here, we will keep 'Pclass', 'Age', 'Fare', 'SibSp', and 'Parch' and store these features in new variables for both the train and test sets:
```
quanti_columns = ['Pclass', 'Age', 'Fare', 'SibSp', 'Parch']
```
```
# Get the quantitative columns
```
```
X_train_quanti = X_train[quanti_columns]
```
```
X_test_quanti = X_test[quanti_columns]
```
Instantiate the simple imputer with a mean strategy. Here, the missing value of a feature will be replaced with the mean value of that feature:
```
# Impute missing quantitative values with mean feature value
```
```
quanti_imputer = SimpleImputer(strategy='mean')
```

Fit the imputer on the train set and apply it to the test set so that it avoids leakage in the imputation:

# Fit and impute the training set

X_train_quanti = quanti_imputer.fit_transform(X_train_quanti)

# Just impute the test set

X_test_quanti = quanti_imputer.transform(X_test_quanti)

Now that imputation has been performed, instantiate the scaler object:
```
# Instantiate the standard scaler
```
```
scaler = StandardScaler()
```

Finally, fit and apply the standard scaler to the train set, and then apply it to the test set:

# Fit and transform the training set

X_train_quanti = scaler.fit_transform(X_train_quanti)

# Just transform the test set

X_test_quanti = scaler.transform(X_test_quanti)

We now have quantitative data with no missing values, fully rescaled, with no data leakage.

There’s more…

In this recipe, we used the simple imputer, assuming there was missing data. In practice, it is highly recommended that you look at the data first to check whether there are missing values, as well as how many. It is possible to look at the number of missing values per column with the following code snippet:

# Display the number of missing data for each column
X_train[quanti_columns].isna().sum()

This will output the following:

Pclass        0
Age         146
Fare           0
SibSp         0
Parch         0

Thanks to this, we know that the Age feature has 146 missing values, while the other features have no missing data.

You're reading from The Regularization Cookbook Explore practical recipes to improve the functionality of your ML models

Table of Contents (14) Chapters

Preparing quantitative data

Getting ready

How to do it…

There’s more…

See also

Authors (1)

Personalised recommendations for you