Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
The Regularization Cookbook

You're reading from   The Regularization Cookbook Explore practical recipes to improve the functionality of your ML models

Arrow left icon
Product type Paperback
Published in Jul 2023
Publisher Packt
ISBN-13 9781837634088
Length 424 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Vincent Vandenbussche Vincent Vandenbussche
Author Profile Icon Vincent Vandenbussche
Vincent Vandenbussche
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Chapter 1: An Overview of Regularization 2. Chapter 2: Machine Learning Refresher FREE CHAPTER 3. Chapter 3: Regularization with Linear Models 4. Chapter 4: Regularization with Tree-Based Models 5. Chapter 5: Regularization with Data 6. Chapter 6: Deep Learning Reminders 7. Chapter 7: Deep Learning Regularization 8. Chapter 8: Regularization with Recurrent Neural Networks 9. Chapter 9: Advanced Regularization in Natural Language Processing 10. Chapter 10: Regularization in Computer Vision 11. Chapter 11: Regularization in Computer Vision – Synthetic Image Generation 12. Index 13. Other Books You May Enjoy

Preparing qualitative data

In this recipe, we will prepare qualitative data, including missing value imputation and encoding.

Getting ready

Qualitative data requires different treatment from quantitative data. Imputing missing values with the mean value of a feature would make no sense (and would not work with non-numeric data): it makes more sense, for example, to use the most frequent value or the mode of a feature. The SimpleImputer class allows us to do such things.

The same goes for rescaling: it would make no sense to rescale qualitative data. Instead, it is more common to encode it. One of the most typical techniques is called one-hot encoding.

The idea is to transform each of the categories, over a total possible N categories, in a vector holding a 1 and N-1 zeros. In our example, the Embarked feature’s one-hot encoding would be as follows:

  • ‘C’ = [1, 0, 0]
  • ‘Q’ = [0, 1, 0]
  • ‘S’ = [0, 0, 1]

Note

Having N columns for N categories is not necessarily optimal. What happens if, in the preceding example, we remove the first column? If the value is not ‘Q’ = [1, 0] nor ‘S’ = [0, 1], then it must be ‘C’ = [0, 0]. There is no need to add one more column to have all the necessary information. This can be generalized to N categories only requiring N-1 columns to have all the information, which is why one-hot encoding functions usually allow you to drop a column.

The sklearn class’ OneHotEncoder allows us to do this. It also allows us to deal with unknown categories that may appear in the test set (or the production environment) with several strategies, such as an error, ignore, or infrequent class. Finally, it allows us to drop the first column after encoding.

How to do it…

Just like in the preceding recipe, we will handle any missing data and the features will be one-hot encoded:

  1. Import the necessary classes – SimpleImputer for missing data imputation (already imported in the previous recipe) and OneHotEncoder for encoding. We also need to import numpy so that we can concatenate the qualitative and quantitative data that’s been prepared at the end of this recipe:
    import numpy as np
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import OneHotEncoder
  2. Select the qualitative features we want to keep: 'Sex' and 'Embarked'. Then, store these features in new variables for both the train and test sets:
    quali_columns = ['Sex', 'Embarked']
    # Get the quantitative columns
    X_train_quali = X_train[quali_columns]
    X_test_quali = X_test[quali_columns]
  3. Instantiate SimpleImputer with most_frequent strategy. Any missing values will be replaced by the most frequent ones:
    # Impute missing qualitative values with most frequent feature value
    quali_imputer =SimpleImputer(strategy='most_frequent')
  4. Fit and transform the imputer on the train set, and then transform the test set:
    # Fit and impute the training set
    X_train_quali = quali_imputer.fit_transform(X_train_quali)
    # Just impute the test set
    X_test_quali = quali_imputer.transform(X_test_quali)
  5. Instantiate the encoder. Here, we will specify the following parameters:
    • drop='first': This will drop the first columns of the encoding
    • handle_unknown='ignore': If a new value appears in the test set (or in production), it will be encoded as zeros:
      # Instantiate the encoder
      encoder=OneHotEncoder(drop='first', handle_unknown='ignore')
  6. Fit and transform the encoder on the training set, and then transform the test set using this encoder:
    # Fit and transform the training set
    X_train_quali = encoder.fit_transform(X_train_quali).toarray()
    # Just encode the test set
    X_test_quali = encoder.transform(X_test_quali).toarray()

Note

We need to use .toarray() out of the encoder because the array is a sparse matrix object by default and cannot be concatenated in that form with the other features.

  1. With that, all the data has been prepared – both quantitative and qualitative (considering this recipe and the previous one). It is now possible to concatenate this data before training a model:
    # Concatenate the data back together
    X_train = np.concatenate([X_train_quanti,
        X_train_quali], axis=1)
    X_test = np.concatenate([X_test_quanti, X_test_quali], axis=1)

There’s more…

It is possible to save the data as a pickle file, either to share it or save it and avoid having to prepare it again. The following code will allow us to do this:

import pickle
pickle.dump((X_train, X_test, y_train, y_test),
    open('prepared_titanic.pkl', 'wb'))

We now have fully prepared data that can be used to train ML models.

Note

Several steps have been omitted or simplified here for more clarity. Data may need more preparation, such as more thorough missing value imputation, outlier and duplicate detection (and perhaps removal), feature engineering, and so on. It is assumed that you already have some sense of those aspects and are encouraged to read other materials about this topic if required.

See also

This more general documentation about missing data imputation is worth looking at: https://scikit-learn.org/stable/modules/impute.html.

Finally, this more general documentation about data preprocessing can be very useful: https://scikit-learn.org/stable/modules/preprocessing.html.

You have been reading a chapter from
The Regularization Cookbook
Published in: Jul 2023
Publisher: Packt
ISBN-13: 9781837634088
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image