Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Applied Supervised Learning with Python

You're reading from   Applied Supervised Learning with Python Use scikit-learn to build predictive models from real-world datasets and prepare yourself for the future of machine learning

Arrow left icon
Product type Paperback
Published in Apr 2019
Publisher
ISBN-13 9781789954920
Length 404 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Ishita Mathur Ishita Mathur
Author Profile Icon Ishita Mathur
Ishita Mathur
Benjamin Johnston Benjamin Johnston
Author Profile Icon Benjamin Johnston
Benjamin Johnston
Arrow right icon
View More author details
Toc

Chapter 1: Python Machine Learning Toolkit


Activity 1: pandas Functions

Solution

  1. Open a new Jupyter notebook.

  2. Use pandas to load the Titanic dataset:

    import pandas as pd
    
    df = pd.read_csv('titanic.csv')

    Use the head() function on the dataset as follows:

    # Have a look at the first 5 sample of the data
    df.head()

    The output will be as follows:

    Figure 1.65: First five rows

    Use the describe function as follows:

    df.describe(include='all')

    The output will be as follows:

    Figure 1.66: Output of describe()

  3. We don't need the Unnamed: 0 column. We can remove the column without using the del command, as follows:

    df = df[df.columns[1:]] # Use the columns
    df.head()

    The output will be as follows:

    Figure 1.67: First five rows after deleting the Unnamed: 0 column

  4. Compute the mean, standard deviation, minimum, and maximum values for the columns of the DataFrame without using describe:

    df.mean()
    
    Fare        33.295479
    Pclass       2.294882
    Age         29.881138
    Parch        0.385027
    SibSp        0.498854
    Survived     0.383838
    dtype: float64
    
    
    df.std()
    
    Fare        51.758668
    Pclass       0.837836
    Age         14.413493
    Parch        0.865560
    SibSp        1.041658
    Survived     0.486592
    dtype: float64
    
    
    df.min()
    
    Fare        0.00
    Pclass      1.00
    Age         0.17
    Parch       0.00
    SibSp       0.00
    Survived    0.00
    dtype: float64
    
    
    df.max()
    
    Fare        512.3292
    Pclass        3.0000
    Age          80.0000
    Parch         9.0000
    SibSp         8.0000
    Survived      1.0000
    dtype: float64
  5. What about the 33, 66, and 99% quartiles? Use the quantile method as follows:

    df.quantile(0.33)
    
    Fare         8.559325
    Pclass       2.000000
    Age         23.000000
    Parch        0.000000
    SibSp        0.000000
    Survived     0.000000
    Name: 0.33, dtype: float64
    
    df.quantile(0.66)
    
    Fare        26.0
    Pclass       3.0
    Age         34.0
    Parch        0.0
    SibSp        0.0
    Survived     1.0
    Name: 0.66, dtype: float64
    
    
    df.quantile(0.99)
    
    Fare        262.375
    Pclass        3.000
    Age          65.000
    Parch         4.000
    SibSp         5.000
    Survived      1.000
    Name: 0.99, dtype: float64
  6. How many passengers were from each class? Let's see, using the groupby method:

    class_groups = df.groupby('Pclass')
    for name, index in class_groups:
        print(f'Class: {name}: {len(index)}')
    
    Class: 1: 323
    Class: 2: 277
    Class: 3: 709
  7. How many passengers were from each class? You can find the answer by using selecting/indexing methods to count the members of each class:

    for clsGrp in df.Pclass.unique():
        num_class = len(df[df.Pclass == clsGrp])
        print(f'Class {clsGrp}: {num_class}')
    
    Class 3: 709
    Class 1: 323
    Class 2: 277

    The answers to Step 6 and Step 7 do match.

  8. Determine who the eldest passenger in third class was:

    third_class = df.loc[(df.Pclass == 3)]
    
    third_class.loc[(third_class.Age == third_class.Age.max())]

    The output will be as follows:

    Figure 1.68: Eldest passenger in third class

  9. For a number of machine learning problems, it is very common to scale the numerical values between 0 and 1. Use the agg method with Lambda functions to scale the Fare and Age columns between 0 and 1:

    fare_max = df.Fare.max()
    age_max = df.Age.max()
    
    df.agg({
        'Fare': lambda x: x / fare_max, 
        'Age': lambda x: x / age_max,
    }).head()

    The output will be as follows:

    Figure 1.69: Scaling numerical values between 0 and 1

  10. There is one individual in the dataset without a listed Fare value:

    df_nan_fare = df.loc[(df.Fare.isna())]
    df_nan_fare

    This is the output:

    Figure 1.70: Individual without a listed Fare value

    Replace the NaN values of this row in the main DataFrame with the mean Fare value for those corresponding with the same class and Embarked location using the groupby method:

    embarked_class_groups = df.groupby(['Embarked', 'Pclass'])
    
    indices = embarked_class_groups.groups[(df_nan_fare.Embarked.values[0], df_nan_fare.Pclass.values[0])]
    mean_fare = df.iloc[indices].Fare.mean()
    df.loc[(df.index == 1043), 'Fare'] = mean_fare
    df.iloc[1043]

    The output will be as follows:

    Cabin                      NaN
    Embarked                     S
    Fare                   14.4354
    Pclass                       3
    Ticket                    3701
    Age                       60.5
    Name        Storey, Mr. Thomas
    Parch                        0
    Sex                       male
    SibSp                        0
    Survived                   NaN
    Name: 1043, dtype: object
lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime