Packt+ | Advance your knowledge in tech

You're reading from Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits A practical guide to implementing supervised and unsupervised machine learning algorithms in Python

Product type Paperback

Published in Jul 2020

Publisher Packt

ISBN-13 9781838826048

Length 384 pages

Edition 1st Edition

Languages

Python

Tools

Scikit-learn

Concepts

Machine Learning

Author (1):

Tarek Amr

View More author details

Installing the packages you need

It's time to install the packages we will need in this book, but first of all, make sure you have Python installed on your computer. In this book, we will be using Python version 3.6. If your computer comes with Python 2.x installed, then you should upgrade Python to version 3.6 or later. I will show you how to install the required packages using pip, Python's de facto package-management system. If you use other package-management systems, such as Anaconda, you can easily find the equivalent installation commands for each of the following packages online.

To install scikit-learn, run the following command:

          $ pip install --upgrade scikit-learn==0.22

I will be using version 0.22 of scikit-learn here. You can add the --userswitch to the pip command to limit the installation to your own directories. This is important if you do not have root access to your machine or if you do not want to install the libraries globally. Furthermore, I prefer to create a virtual environment for each project I work on and install all the libraries I need for this project into that environment. You can check the documentation for Anaconda or Python's venv module to see how to create virtual environments.

Along with scikit-learn, we will need to install pandas. I will briefly introduce pandas in the next section, but for now, you can use the following command to install it:

          $ pip install --upgrade pandas==0.25.3

Optionally, you may need to install Jupyter. Jupyter notebooks allow you to write code in your browser and run bits of it in whichever order you want. This makes it ideal for experimentation and trying different parameters without the need to rerun the entire code every time. You can also plot graphs in your notebooks with the help of Matplotlib. Use the following commands to install both Jupyter and Matplotlib:

          $ pip install jupyter
          

          $ pip install matplotlib

To start your Jupyter server, you can run jupyter notebookin your terminal, and then visit http://localhost:8888/in your browser.

We will make use of other libraries later on in the book. I'd rather introduce you to them when we need them and show you how to install each of them then.

Introduction to pandas

pandas is an open source library that provides data analysis tools for the Python programming language. If this definition doesn't tell you much, then you may think of pandas as Python's response to spreadsheets. I have decided to dedicate this section to pandas since you will be using it to create and load the data you are going to use in this book. You will also use pandas to analyze and visualize your data and alter the value of its columns before applying machine learning algorithms to it.

Tables in pandas are referred to as DataFrames. If you are an R programmer, then this name should be familiar to you. Now, let's start by creating a DataFrame for some polygon names and the number of sides each has:

# It's customary to call pandas pd when importing it
import pandas as pd

polygons_data_frame = pd.DataFrame(
    {
         'Name': ['Triangle', 'Quadrilateral', 'Pentagon', 'Hexagon'],
         'Sides': [3, 4, 5, 6],
     }
)

You can then use the head method to print the first N rows of your newly created DataFrame:

polygons_data_frame.head(3)

Here, you can see the first three rows of the DataFrame. In addition to the columns we specified, pandas add a default index:

Since we are programming in Python, we can also use the language's built-in function or even use our custom-built functions when creating a DataFrame. Here, we will use the range generator, rather than typing in all the possible side counts ourselves:

polygons = {
    'Name': [
        'Triangle', 'Quadrilateral', 'Pentagon', 'Hexagon', 'Heptagon', 'Octagon', 'Nonagon', 'Decagon', 'Hendecagon', 'Dodecagon', 'Tridecagon', 'Tetradecagon'
     ],
     # Range parameters are the start, the end of the range and the step
     'Sides': range(3, 15, 1), 
}
polygons_data_frame = pd.DataFrame(polygons)

You can also sort your DataFrame by column. Here, we will sort it by polygon name in alphabetical order, and then print the first five polygons:

polygons_data_frame.sort_values('Name').head(5)

This time, we can see the first five rows of the DataFrame after it has been ordered by the names of the polygons in alphabetical order:

Feature engineering is the art of deriving new features by manipulating existing data. This is something that pandas is good at. In the following example, we are creating a new column, Length of Name, and adding the character lengths of each polygon's name:

polygons_data_frame[
   'Length of Name'
] = polygons_data_frame['Name'].str.len()

We use str to be able to access the string functions to apply them to the values in the Name column. We then use the len method of a string. One other way to achieve the same result is to use the apply() function. If you call apply() on a column, you can get access to the values in the column. You can then apply any Python built-in or custom functions there. Here are two examples of how to use the apply() function.

Example 1 is as follows:

polygons_data_frame[
   'Length of Name'
] = polygons_data_frame['Name'].apply(len)

Example 2 is as follows:

polygons_data_frame[
   'Length of Name'
] = polygons_data_frame['Name'].apply(lambda n: len(n))

The good thing about the apply() method is that it allows you to run your own custom code anywhere, which is something you will need to use a lot when performing complex feature engineering. Nevertheless, the code you run using the apply() method isn't as optimized as the code in the first example. This is a clear case of flexibility versus performance trade-off that you should be aware of.

Finally, we can use the plotting capabilities provided by pandas and Matplotlib to see whether there is any correlation between the number of sides a polygon has and the length of its name:

# We use the DataFrame's plot method here, 
# where we specify that this is a scatter plot
# and also specify which columns to use for x and y
polygons_data_frame.plot(
    title='Sides vs Length of Name',
    kind='scatter',
    x='Sides',
    y='Length of Name',
)

Once we run the previous code, the following scatter plots will be displayed:

Scatter plots are generally useful for seeing correlations between two features. In the following plot, there is no clear correlation to be seen.

Python's scientific computing ecosystem conventions

Throughout this book, I will be using pandas, NumPy, SciPy, Matplotlib, and Seaborn. Any time you see the np, sp, pd, sns, and pltprefixes,you should assume that I have run the following import statements prior to the code:

import numpy as np
import scipy as sp
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

This is the de facto way of importing the scientific computing ecosystem into Python. If any of these libraries is missing on your computer, here is how to install them using pip:

          $ pip install --upgrade numpy==1.17.3
          

          $ pip install --upgrade scipy==1.3.1
          

          $ pip install --upgrade pandas==0.25.3
          

          $ pip install --upgrade scikit-learn==0.22
          

          $ pip install --upgrade matplotlib==3.1.2
          

          $ pip install --upgrade seaborn==0.9.0

Usually, you do not need to specify the versions for each library; running pip install numpy will just install the latest stable version of the library. Nevertheless, pinning the version is good practice for reproducibility. It ensures the same results from the same code when it runs on different machines.

The code used in this book is written in Jupyter notebooks. I advise you to do the same on your machine. In general, the code should run smoothly in any other environment with very few changes when it comes to printing and displaying the results. If the figures are not shown in your Jupyter notebook, you may need to run the following line at least once in any cell at the beginning of your notebook:

          %matplotlib inline

Furthermore, randomness is quite common in many machine learning tasks. We may need to create random data to use with our algorithms. We may also randomly split this data into training and test sets. The algorithms themselves may use random values for initialization. There are tricks to make sure we all get the exact same results by using pseudo-random numbers. I will be using these tricks when needed sometimes, but other times, it would be better to make sure we get slightly different results to give you an idea of how things are not always deterministic and how to find ways to deal with underlying uncertainties. More on this later.