Packt+ | Advance your knowledge in tech

You're reading from Applied Supervised Learning with Python Use scikit-learn to build predictive models from real-world datasets and prepare yourself for the future of machine learning

Product type Paperback

Published in Apr 2019

Publisher

ISBN-13 9781789954920

Length 404 pages

Edition 1st Edition

Languages

Python

Tools

Scikit-learn

Concepts

Machine Learning

Authors (2):

Ishita Mathur

Benjamin Johnston

View More author details

Table of Contents (9) Chapters

Applied Supervised Learning with Python

Preface

1. Python Machine Learning Toolkit

2. Exploratory Data Analysis and Visualization FREE CHAPTER

3. Regression Analysis

4. Classification

5. Ensemble Modeling

6. Model Evaluation

Appendix

Chapter 1: Python Machine Learning Toolkit

Activity 1: pandas Functions

Solution

Open a new Jupyter notebook.
Use pandas to load the Titanic dataset:
```
import pandas as pd

df = pd.read_csv('titanic.csv')
```
Use the head() function on the dataset as follows:
```
# Have a look at the first 5 sample of the data
df.head()
```
The output will be as follows:
Figure 1.65: First five rows
Use the describe function as follows:
```
df.describe(include='all')
```
The output will be as follows:
Figure 1.66: Output of describe()
We don't need the Unnamed: 0 column. We can remove the column without using the del command, as follows:
```
df = df[df.columns[1:]] # Use the columns
df.head()
```
The output will be as follows:
Figure 1.67: First five rows after deleting the Unnamed: 0 column

Compute the mean, standard deviation, minimum, and maximum values for the columns of the DataFrame without using describe:

df.mean()

Fare        33.295479
Pclass       2.294882
Age         29.881138
Parch        0.385027
SibSp        0.498854
Survived     0.383838
dtype: float64


df.std()

Fare        51.758668
Pclass       0.837836
Age         14.413493
Parch        0.865560
SibSp        1.041658
Survived     0.486592
dtype: float64


df.min()

Fare        0.00
Pclass      1.00
Age         0.17
Parch       0.00
SibSp       0.00
Survived    0.00
dtype: float64


df.max()

Fare        512.3292
Pclass        3.0000
Age          80.0000
Parch         9.0000
SibSp         8.0000
Survived      1.0000
dtype: float64

What about the 33, 66, and 99% quartiles? Use the quantile method as follows:

df.quantile(0.33)

Fare         8.559325
Pclass       2.000000
Age         23.000000
Parch        0.000000
SibSp        0.000000
Survived     0.000000
Name: 0.33, dtype: float64

df.quantile(0.66)

Fare        26.0
Pclass       3.0
Age         34.0
Parch        0.0
SibSp        0.0
Survived     1.0
Name: 0.66, dtype: float64


df.quantile(0.99)

Fare        262.375
Pclass        3.000
Age          65.000
Parch         4.000
SibSp         5.000
Survived      1.000
Name: 0.99, dtype: float64

How many passengers were from each class? Let's see, using the groupby method:

class_groups = df.groupby('Pclass')
for name, index in class_groups:
    print(f'Class: {name}: {len(index)}')

Class: 1: 323
Class: 2: 277
Class: 3: 709

How many passengers were from each class? You can find the answer by using selecting/indexing methods to count the members of each class:
```
for clsGrp in df.Pclass.unique():
    num_class = len(df[df.Pclass == clsGrp])
    print(f'Class {clsGrp}: {num_class}')

Class 3: 709
Class 1: 323
Class 2: 277
```
The answers to Step 6 and Step 7 do match.
Determine who the eldest passenger in third class was:
```
third_class = df.loc[(df.Pclass == 3)]

third_class.loc[(third_class.Age == third_class.Age.max())]
```
The output will be as follows:
Figure 1.68: Eldest passenger in third class
For a number of machine learning problems, it is very common to scale the numerical values between 0 and 1. Use the agg method with Lambda functions to scale the Fare and Age columns between 0 and 1:
```
fare_max = df.Fare.max()
age_max = df.Age.max()

df.agg({
    'Fare': lambda x: x / fare_max, 
    'Age': lambda x: x / age_max,
}).head()
```
The output will be as follows:
Figure 1.69: Scaling numerical values between 0 and 1

There is one individual in the dataset without a listed Fare value:

df_nan_fare = df.loc[(df.Fare.isna())]
df_nan_fare

This is the output:

Figure 1.70: Individual without a listed Fare value

Replace the NaN values of this row in the main DataFrame with the mean Fare value for those corresponding with the same class and Embarked location using the groupby method:

embarked_class_groups = df.groupby(['Embarked', 'Pclass'])

indices = embarked_class_groups.groups[(df_nan_fare.Embarked.values[0], df_nan_fare.Pclass.values[0])]
mean_fare = df.iloc[indices].Fare.mean()
df.loc[(df.index == 1043), 'Fare'] = mean_fare
df.iloc[1043]

The output will be as follows:

Cabin                      NaN
Embarked                     S
Fare                   14.4354
Pclass                       3
Ticket                    3701
Age                       60.5
Name        Storey, Mr. Thomas
Parch                        0
Sex                       male
SibSp                        0
Survived                   NaN
Name: 1043, dtype: object