Chapter 1: Python Machine Learning Toolkit
Activity 1: pandas Functions
Solution
Open a new Jupyter notebook.
Use pandas to load the Titanic dataset:
import pandas as pd df = pd.read_csv('titanic.csv')
Use the head() function on the dataset as follows:
# Have a look at the first 5 sample of the data df.head()
The output will be as follows:
Use the describe function as follows:
df.describe(include='all')
The output will be as follows:
We don't need the Unnamed: 0 column. We can remove the column without using the del command, as follows:
df = df[df.columns[1:]] # Use the columns df.head()
The output will be as follows:
Compute the mean, standard deviation, minimum, and maximum values for the columns of the DataFrame without using describe:
df.mean() Fare 33.295479 Pclass 2.294882 Age 29.881138 Parch 0.385027 SibSp 0.498854 Survived 0.383838 dtype: float64 df.std() Fare 51.758668 Pclass 0.837836 Age 14.413493 Parch 0.865560 SibSp 1.041658 Survived 0.486592 dtype: float64 df.min() Fare 0.00 Pclass 1.00 Age 0.17 Parch 0.00 SibSp 0.00 Survived 0.00 dtype: float64 df.max() Fare 512.3292 Pclass 3.0000 Age 80.0000 Parch 9.0000 SibSp 8.0000 Survived 1.0000 dtype: float64
What about the 33, 66, and 99% quartiles? Use the quantile method as follows:
df.quantile(0.33) Fare 8.559325 Pclass 2.000000 Age 23.000000 Parch 0.000000 SibSp 0.000000 Survived 0.000000 Name: 0.33, dtype: float64 df.quantile(0.66) Fare 26.0 Pclass 3.0 Age 34.0 Parch 0.0 SibSp 0.0 Survived 1.0 Name: 0.66, dtype: float64 df.quantile(0.99) Fare 262.375 Pclass 3.000 Age 65.000 Parch 4.000 SibSp 5.000 Survived 1.000 Name: 0.99, dtype: float64
How many passengers were from each class? Let's see, using the groupby method:
class_groups = df.groupby('Pclass') for name, index in class_groups: print(f'Class: {name}: {len(index)}') Class: 1: 323 Class: 2: 277 Class: 3: 709
How many passengers were from each class? You can find the answer by using selecting/indexing methods to count the members of each class:
for clsGrp in df.Pclass.unique(): num_class = len(df[df.Pclass == clsGrp]) print(f'Class {clsGrp}: {num_class}') Class 3: 709 Class 1: 323 Class 2: 277
The answers to Step 6 and Step 7 do match.
Determine who the eldest passenger in third class was:
third_class = df.loc[(df.Pclass == 3)] third_class.loc[(third_class.Age == third_class.Age.max())]
The output will be as follows:
For a number of machine learning problems, it is very common to scale the numerical values between 0 and 1. Use the agg method with Lambda functions to scale the Fare and Age columns between 0 and 1:
fare_max = df.Fare.max() age_max = df.Age.max() df.agg({ 'Fare': lambda x: x / fare_max, 'Age': lambda x: x / age_max, }).head()
The output will be as follows:
There is one individual in the dataset without a listed Fare value:
df_nan_fare = df.loc[(df.Fare.isna())] df_nan_fare
This is the output:
Replace the NaN values of this row in the main DataFrame with the mean Fare value for those corresponding with the same class and Embarked location using the groupby method:
embarked_class_groups = df.groupby(['Embarked', 'Pclass']) indices = embarked_class_groups.groups[(df_nan_fare.Embarked.values[0], df_nan_fare.Pclass.values[0])] mean_fare = df.iloc[indices].Fare.mean() df.loc[(df.index == 1043), 'Fare'] = mean_fare df.iloc[1043]
The output will be as follows:
Cabin NaN Embarked S Fare 14.4354 Pclass 3 Ticket 3701 Age 60.5 Name Storey, Mr. Thomas Parch 0 Sex male SibSp 0 Survived NaN Name: 1043, dtype: object