You're reading from The Regularization Cookbook Explore practical recipes to improve the functionality of your ML models

Product type Paperback

Published in Jul 2023

Publisher Packt

ISBN-13 9781837634088

Length 424 pages

Edition 1st Edition

Languages

Ring

Tools

Astro

Concepts

Machine Learning

Author (1):

Vincent Vandenbussche

View More author details

Table of Contents (14) Chapters

Preface

1. Chapter 1: An Overview of Regularization

2. Chapter 2: Machine Learning Refresher FREE CHAPTER

3. Chapter 3: Regularization with Linear Models

4. Chapter 4: Regularization with Tree-Based Models

5. Chapter 5: Regularization with Data

6. Chapter 6: Deep Learning Reminders

7. Chapter 7: Deep Learning Regularization

8. Chapter 8: Regularization with Recurrent Neural Networks

9. Chapter 9: Advanced Regularization in Natural Language Processing

10. Chapter 10: Regularization in Computer Vision

11. Chapter 11: Regularization in Computer Vision – Synthetic Image Generation

12. Index

Why subscribe?

13. Other Books You May Enjoy

Preparing qualitative data

In this recipe, we will prepare qualitative data, including missing value imputation and encoding.

Getting ready

Qualitative data requires different treatment from quantitative data. Imputing missing values with the mean value of a feature would make no sense (and would not work with non-numeric data): it makes more sense, for example, to use the most frequent value or the mode of a feature. The SimpleImputer class allows us to do such things.

The same goes for rescaling: it would make no sense to rescale qualitative data. Instead, it is more common to encode it. One of the most typical techniques is called one-hot encoding.

The idea is to transform each of the categories, over a total possible N categories, in a vector holding a 1 and N-1 zeros. In our example, the Embarked feature’s one-hot encoding would be as follows:

‘C’ = [1, 0, 0]
‘Q’ = [0, 1, 0]
‘S’ = [0, 0, 1]

Note

Having N columns for N categories is not necessarily optimal. What happens if, in the preceding example, we remove the first column? If the value is not ‘Q’ = [1, 0] nor ‘S’ = [0, 1], then it must be ‘C’ = [0, 0]. There is no need to add one more column to have all the necessary information. This can be generalized to N categories only requiring N-1 columns to have all the information, which is why one-hot encoding functions usually allow you to drop a column.

The sklearn class’ OneHotEncoder allows us to do this. It also allows us to deal with unknown categories that may appear in the test set (or the production environment) with several strategies, such as an error, ignore, or infrequent class. Finally, it allows us to drop the first column after encoding.

How to do it…

Just like in the preceding recipe, we will handle any missing data and the features will be one-hot encoded:

Import the necessary classes – SimpleImputer for missing data imputation (already imported in the previous recipe) and OneHotEncoder for encoding. We also need to import numpy so that we can concatenate the qualitative and quantitative data that’s been prepared at the end of this recipe:
```
import numpy as np
```
```
from sklearn.impute import SimpleImputer
```
```
from sklearn.preprocessing import OneHotEncoder
```
Select the qualitative features we want to keep: 'Sex' and 'Embarked'. Then, store these features in new variables for both the train and test sets:
```
quali_columns = ['Sex', 'Embarked']
```
```
# Get the quantitative columns
```
```
X_train_quali = X_train[quali_columns]
```
```
X_test_quali = X_test[quali_columns]
```

Instantiate SimpleImputer with most_frequent strategy. Any missing values will be replaced by the most frequent ones:

# Impute missing qualitative values with most frequent feature value

quali_imputer =SimpleImputer(strategy='most_frequent')

Fit and transform the imputer on the train set, and then transform the test set:

# Fit and impute the training set

X_train_quali = quali_imputer.fit_transform(X_train_quali)

# Just impute the test set

X_test_quali = quali_imputer.transform(X_test_quali)

Instantiate the encoder. Here, we will specify the following parameters:
- drop='first': This will drop the first columns of the encoding
- handle_unknown='ignore': If a new value appears in the test set (or in production), it will be encoded as zeros:
```
# Instantiate the encoder
```
```
encoder=OneHotEncoder(drop='first', handle_unknown='ignore')
```

Fit and transform the encoder on the training set, and then transform the test set using this encoder:

# Fit and transform the training set

X_train_quali = encoder.fit_transform(X_train_quali).toarray()

# Just encode the test set

X_test_quali = encoder.transform(X_test_quali).toarray()

Note

We need to use .toarray() out of the encoder because the array is a sparse matrix object by default and cannot be concatenated in that form with the other features.

With that, all the data has been prepared – both quantitative and qualitative (considering this recipe and the previous one). It is now possible to concatenate this data before training a model:
```
# Concatenate the data back together
```
```
X_train = np.concatenate([X_train_quanti,
```
```
    X_train_quali], axis=1)
```
```
X_test = np.concatenate([X_test_quanti, X_test_quali], axis=1)
```

There’s more…

It is possible to save the data as a pickle file, either to share it or save it and avoid having to prepare it again. The following code will allow us to do this:

import pickle
pickle.dump((X_train, X_test, y_train, y_test),
    open('prepared_titanic.pkl', 'wb'))

We now have fully prepared data that can be used to train ML models.

Note

Several steps have been omitted or simplified here for more clarity. Data may need more preparation, such as more thorough missing value imputation, outlier and duplicate detection (and perhaps removal), feature engineering, and so on. It is assumed that you already have some sense of those aspects and are encouraged to read other materials about this topic if required.

You're reading from The Regularization Cookbook Explore practical recipes to improve the functionality of your ML models

Table of Contents (14) Chapters

Preparing qualitative data

Getting ready

How to do it…

There’s more…

See also

Authors (1)

Personalised recommendations for you