Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
The Regularization Cookbook

You're reading from   The Regularization Cookbook Explore practical recipes to improve the functionality of your ML models

Arrow left icon
Product type Paperback
Published in Jul 2023
Publisher Packt
ISBN-13 9781837634088
Length 424 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Vincent Vandenbussche Vincent Vandenbussche
Author Profile Icon Vincent Vandenbussche
Vincent Vandenbussche
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Chapter 1: An Overview of Regularization 2. Chapter 2: Machine Learning Refresher FREE CHAPTER 3. Chapter 3: Regularization with Linear Models 4. Chapter 4: Regularization with Tree-Based Models 5. Chapter 5: Regularization with Data 6. Chapter 6: Deep Learning Reminders 7. Chapter 7: Deep Learning Regularization 8. Chapter 8: Regularization with Recurrent Neural Networks 9. Chapter 9: Advanced Regularization in Natural Language Processing 10. Chapter 10: Regularization in Computer Vision 11. Chapter 11: Regularization in Computer Vision – Synthetic Image Generation 12. Index 13. Other Books You May Enjoy

Loading data

The primary focus of this recipe is to load data from a CSV file. However, this is not the only thing that this recipe covers. Since the data is usually the first step in any ML project, this recipe is also a good opportunity to give a quick recap of the ML workflow, as well as the different types of data.

Getting ready

Before loading the data, we should keep in mind that an ML model follows a two-step process:

  1. Train a model on a given dataset to create a new model.
  2. Reuse the previously trained model to infer predictions on new data.

These two steps are summarized in the following figure:

Figure 2.1 – A simple view of the two-step ML process

Figure 2.1 – A simple view of the two-step ML process

Of course, in most cases, this is a rather simplistic view. A more detailed view can be seen in Figure 2.2:

Figure 2.2 – A more complete view of the ML process

Figure 2.2 – A more complete view of the ML process

Let’s take a closer look at the training part of the ML process shown in Figure 2.2:

  1. First, training data is queried from a data source (this can be a database, a data lake, an open dataset, and so on).
  2. The data is preprocessed, such as via feature engineering, rescaling, and so on.
  3. A model is trained and stored (on a data lake, locally, on the edge, and so on).
  4. Optionally, the output of this model is post-processed – for example, via formatting, heuristics, business rules, and more.
  5. Optionally again, this model (with or without postprocessing) is stored in a database for later reference or evaluation if needed.

Now, let’s take a look at the inference part of the ML process:

  1. The data is queried from a data source (a database, an API query, and so on).
  2. The data goes through the same preprocessing step as the training data.
  3. The trained model is fetched if it doesn’t already exist locally.
  4. The model is used to infer output.
  5. Optionally, the output of the model is post-processed via the same post-processing step as the training data.
  6. Optionally, the output is stored in a database for monitoring and later reference.

Even in this schema, many steps were not mentioned: splitting data for training purposes, using evaluation metrics, cross-validation, hyperparameter optimization, and others. This chapter will dive into the more training-specific steps and apply them to the very common but practical Titanic dataset, a binary classification problem. But first, we need to load the data.

To do so, you must download the Titanic dataset training set locally. This can be performed with the following command line:

wget https://github.com/PacktPublishing/The-Regularization-Cookbook/blob/main/chapter_02/train.csv

How to do it…

This recipe is about loading a CSV file and displaying a few lines of code so that we can have a first glance at what it is about:

  1. The first step is to import the required libraries. Here, the only library we need is pandas:
    import pandas as pd
  2. Now, we can load the data using the read_csv function provided by pandas. The first argument is the path to the file. Assuming the file is named train.csv and located in the current folder, we only have to provide train.csv as an argument:
    # Load the data as a DataFrame
    df = pd.read_csv('train.csv')

The returned object is a dataframe object, which provides many useful methods for data processing.

  1. Now, we can display the first five lines of the loaded file using the .head() method:
    # Display the first 5 rows of the dataset
    df.head()

This code will output the following:

   PassengerId  Survived  Pclass  \
0        1            0         3
1        2            1         1
2        3            1         3
3        4            1         1
4        5            0         3
      Name                      Sex   Age     SibSp  \
0   Braund, Mr. Owen Harris     male  22.0       1
1  Cumings, Mrs. John Bradley (Florence Briggs Th...
                               female  38.0        1
2  Heikkinen, Miss. Laina  female  26.0        0
3  Futrelle, Mrs. Jacques Heath (Lily May Peel)
                            female  35.0        1
4  Allen, Mr. William Henry     male  35.0        0
 Parch      Ticket   Fare   Cabin        Embarked
0  0         A/5   21171   7.2500   NaN           S
1  0       PC 17599  71.2833   C85       C
2  0      STON/O2. 3101282   7.9250   NaN       S
3  0        113803  53.1000  C123           S
4  0        373450   8.0500   NaN    S

Here is a description of the data types in each column:

  • PassengerId (qualitative): A unique, arbitrary ID for each passenger.
  • Survived (qualitative): 1 for yes, 0 for no. This is our label, so this is a binary classification problem.
  • Pclass (quantitative, discrete): The class, which is arguably quantitative. Is class 1 better than class 2? Most likely yes.
  • Name (unstructured): The name and title of the passenger.
  • Sex (qualitative): The registered sex of the passenger, either male or female.
  • Age (quantitative, discrete): The age of the passenger.
  • SibSp (quantitative, discrete): The number of siblings and spouses on board.
  • Parch (quantitative, discrete): The number of parents and children on board.
  • Ticket (unstructured): The ticket reference.
  • Fare (quantitative, continuous): The ticket price.
  • Cabin (unstructured): The cabin number, which is arguably unstructured. It can be seen as a qualitative feature with high cardinality.
  • Embarked (qualitative): The embarked city, either Southampton (S), Cherbourg (C), or Queenstown (Q).

There’s more…

Let’s talk about the different types of data that are available. Data is a very generic word and can describe many things. We are surrounded by data all the time. One way to specify data is using opposites.

Data can be structured or unstructured:

  • Structured data comes in the form of tables, databases, Excel files, CSV files, and JSON files.
  • Unstructured data does not fit in a table: it can be text, sound, image, videos, and so on. Even if we tend to have tabular representation, this kind of data does not naturally fit in an Excel table.

Data can be quantitative or qualitative.

Quantitative data is ordered. Here are some examples:

  • €100 is greater than €10
  • 1.8 meters is taller than 1.6 meters
  • 18 years old is younger than 80 years old

Qualitative data has no intrinsic order, as shown here:

  • Blue is not intrinsically better than red
  • A dog is not intrinsically greater than a cat
  • A kitchen is not intrinsically more useful than a bathroom

These are not mutually exclusive. An object can have both quantitative and qualitative features, as can be seen in the case of the car in the following figure:

Figure 2.3 – A single object depicted by both quantitative (left) and qualitative (right) features

Figure 2.3 – A single object depicted by both quantitative (left) and qualitative (right) features

Finally, data can be continuous or discrete.

Some data is continuous, as follows:

  • A weight
  • A volume
  • A price

On the other hand, some data is discrete:

  • A color
  • A football score
  • A nationality

Note

Discrete != qualitative.

For example, a football score is discrete, but there is an intrinsic order: 3 points is more than 2.

See also

The pandas read_csv function has a lot of flexibility as it can use other separators, handle headers, and much more. This is described in the official documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html.

The pandas library allows I/O operations that have different types of inputs. For more information, have a look at the official documentation: https://pandas.pydata.org/docs/reference/io.html.

You have been reading a chapter from
The Regularization Cookbook
Published in: Jul 2023
Publisher: Packt
ISBN-13: 9781837634088
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image