Chapter 1: Python Machine Learning Toolkit
Activity 1: pandas Functions
Solution
Open a new Jupyter notebook.
Use pandas to load the Titanic dataset:
import pandas as pd df = pd.read_csv('titanic.csv')
Use the head() function on the dataset as follows:
# Have a look at the first 5 sample of the data df.head()
The output will be as follows:
Figure 1.65: First five rows
Use the describe function as follows:
df.describe(include='all')
The output will be as follows:
Figure 1.66: Output of describe()
We don't need the Unnamed: 0 column. We can remove the column without using the del command, as follows:
df = df[df.columns[1:]] # Use the columns df.head()
The output will be as follows:
Figure 1.67: First five rows after deleting the Unnamed: 0 column
Compute the mean, standard deviation, minimum, and maximum values for the columns of the DataFrame without using describe:
df.mean() Fare 33.295479 Pclass 2.294882 Age 29.881138 Parch 0.385027 SibSp 0.498854 Survived 0.383838 dtype: float64 df.std() Fare 51.758668 Pclass 0.837836 Age 14.413493 Parch 0.865560 SibSp 1.041658 Survived 0.486592 dtype: float64 df.min() Fare 0.00 Pclass 1.00 Age 0.17 Parch 0.00 SibSp 0.00 Survived 0.00 dtype: float64 df.max() Fare 512.3292 Pclass 3.0000 Age 80.0000 Parch 9.0000 SibSp 8.0000 Survived 1.0000 dtype: float64
What about the 33, 66, and 99% quartiles? Use the quantile method as follows:
df.quantile(0.33) Fare 8.559325 Pclass 2.000000 Age 23.000000 Parch 0.000000 SibSp 0.000000 Survived 0.000000 Name: 0.33, dtype: float64 df.quantile(0.66) Fare 26.0 Pclass 3.0 Age 34.0 Parch 0.0 SibSp 0.0 Survived 1.0 Name: 0.66, dtype: float64 df.quantile(0.99) Fare 262.375 Pclass 3.000 Age 65.000 Parch 4.000 SibSp 5.000 Survived 1.000 Name: 0.99, dtype: float64
How many passengers were from each class? Let's see, using the groupby method:
class_groups = df.groupby('Pclass') for name, index in class_groups: print(f'Class: {name}: {len(index)}') Class: 1: 323 Class: 2: 277 Class: 3: 709
How many passengers were from each class? You can find the answer by using selecting/indexing methods to count the members of each class:
for clsGrp in df.Pclass.unique(): num_class = len(df[df.Pclass == clsGrp]) print(f'Class {clsGrp}: {num_class}') Class 3: 709 Class 1: 323 Class 2: 277
The answers to Step 6 and Step 7 do match.
Determine who the eldest passenger in third class was:
third_class = df.loc[(df.Pclass == 3)] third_class.loc[(third_class.Age == third_class.Age.max())]
The output will be as follows:
Figure 1.68: Eldest passenger in third class
For a number of machine learning problems, it is very common to scale the numerical values between 0 and 1. Use the agg method with Lambda functions to scale the Fare and Age columns between 0 and 1:
fare_max = df.Fare.max() age_max = df.Age.max() df.agg({ 'Fare': lambda x: x / fare_max, 'Age': lambda x: x / age_max, }).head()
The output will be as follows:
Figure 1.69: Scaling numerical values between 0 and 1
There is one individual in the dataset without a listed Fare value:
df_nan_fare = df.loc[(df.Fare.isna())] df_nan_fare
This is the output:
Figure 1.70: Individual without a listed Fare value
Replace the NaN values of this row in the main DataFrame with the mean Fare value for those corresponding with the same class and Embarked location using the groupby method:
embarked_class_groups = df.groupby(['Embarked', 'Pclass']) indices = embarked_class_groups.groups[(df_nan_fare.Embarked.values[0], df_nan_fare.Pclass.values[0])] mean_fare = df.iloc[indices].Fare.mean() df.loc[(df.index == 1043), 'Fare'] = mean_fare df.iloc[1043]
The output will be as follows:
Cabin NaN Embarked S Fare 14.4354 Pclass 3 Ticket 3701 Age 60.5 Name Storey, Mr. Thomas Parch 0 Sex male SibSp 0 Survived NaN Name: 1043, dtype: object