Packt+ | Advance your knowledge in tech

You're reading from Applied Supervised Learning with Python Use scikit-learn to build predictive models from real-world datasets and prepare yourself for the future of machine learning

Product type Paperback

Published in Apr 2019

Publisher

ISBN-13 9781789954920

Length 404 pages

Edition 1st Edition

Languages

Python

Tools

Scikit-learn

Concepts

Machine Learning

Authors (2):

Benjamin Johnston

Ishita Mathur

View More author details

Table of Contents (9) Chapters

Applied Supervised Learning with Python

Preface

1. Python Machine Learning Toolkit

2. Exploratory Data Analysis and Visualization FREE CHAPTER

3. Regression Analysis

4. Classification

5. Ensemble Modeling

6. Model Evaluation

Appendix

Supervised Machine Learning

A machine learning algorithm is commonly thought of as simply the mathematical process (or algorithm) itself, such as a neural network, deep neural network, or random forest algorithm. However, this is only a component of the overall system; firstly, we must define the problem that can be adequately solved using such techniques. Then, we must specify and procure a clean dataset that is composed of information that can be mapped from the first number space to a secondary one. Once the dataset has been designed and procured, the machine learning model can be specified and designed; for example, a single-layer neural network with 100 hidden nodes that uses a tanh activation function.

With the dataset and model well defined, the means of determining the exact values for the model can be specified. This is a repetitive optimization process that evaluates the output of the model against some existing data and is commonly referred to as training. Once training has been completed and you have your defined model, then it is good practice to evaluate it against some reference data to provide a benchmark of overall performance.

Considering this general description of a complete machine learning algorithm, the problem definition and data collection stages are often the most critical. What is the problem you are trying to solve? What outcome would you like to achieve? How are you going to achieve it? How you answer these questions will drive and define many of the subsequent decisions or model design choices. It is also in answering these questions that we will select which category of machine learning algorithms we will choose: supervised or unsupervised methods.

So, what exactly are supervised and unsupervised machine learning problems or methods? Supervised learning techniques center on mapping some set of information to another by providing the training process with the input information and the desired outputs, then checking its ability to provide the correct result. As an example, let's say you are the publisher of a magazine that reviews and ranks hairstyles from various time periods. Your readers frequently send you far more images of their favorite hairstyles for review than you can manually process. To save some time, you would like to automate the sorting of the hairstyles images you receive based on time periods, starting with hairstyles from the 1960s and 1980s:

Figure 1.1: Hairstyles images from different time periods

To create your hairstyles-sorting algorithm, you start by collecting a large sample of hairstyles images and manually labeling each one with its corresponding time period. Such a dataset (known as a labeled dataset) is the input data (hairstyles images) and the desired output information (time period) is known and recorded. This type of problem is a classic supervised learning problem; we are trying to develop an algorithm that takes a set of inputs and learns to return the answers that we have told it are correct.

When to Use Supervised Learning

Generally, if you are trying to automate or replicate an existing process, the problem is a supervised learning problem. Supervised learning techniques are both very useful and powerful, and you may have come across them or even helped create labeled datasets for them without realizing. As an example, a few years ago, Facebook introduced the ability to tag your friends in any image uploaded to the platform. To tag a friend, you would draw a square over your friend's face and then add the name of your friend to notify them of the image. Fast-forward to today and Facebook will automatically identify your friends in the image and tag them for you. This is yet another example of supervised learning. If you ever used the early tagging system and manually identified your friends in an image, you were in fact helping to create Facebook's labeled dataset. A user who uploaded an image of a person's face (the input data) and tagged the photo with the subject's name would then create the label for the dataset. As users continued to use this tagging service, a sufficiently large labeled dataset was created for the supervised learning problem. Now friend-tagging is completed automatically by Facebook, replacing the manual process with a supervised learning algorithm, as opposed to manual user input:

Figure 1.2: Tagging a friend on Facebook

One particularly timely and straightforward example of supervised learning is the training of self-driving cars. In this example, the algorithm uses the target route as determined by the GPS system, as well as on-board instrumentation, such as speed measures, the brake position, and/or a camera or Light Detection and Ranging (LIDAR), for road obstacle detection as the labeled outputs of the system. During training, the algorithm samples the control inputs as provided by the human driver, such as speed, steering angle, and brake position, mapping them against the outputs of the system; thus providing the labeled dataset. This data can then be used to train the driving/navigation systems within the self-driving car or in simulation exercises.

Image-based supervised problems, while popular, are not the only examples of supervised learning problems. Supervised learning is also commonly used in the automatic analysis of text to determine whether the opinion or tone of a message is positive, negative, or neutral. Such analysis is known as sentiment analysis and frequently involves creating and using a labeled dataset of a series of words or statements that are manually identified as either positive, neutral, or negative. Consider these sentences: I like that movie and I hate that movie. The first sentence is clearly positive, while the second is negative. We can then decompose the words in the sentences into either positive, negative, or neutral (both positive, both negative); see the following table: 

Figure 1.3: Decomposition of the words

Using sentiment analysis, a supervised learning algorithm could be created, say, using the movie database site IMDb to analyze comments posted about movies to determine whether the movie is being positively or negatively reviewed by the audience. Supervised learning methods could have other applications, such as analyzing customer complaints, automating troubleshooting calls/chat sessions, or even medical applications such as analyzing images of moles to detect abnormalities (https://www.nature.com/articles/nature21056).

This should give you a good understanding of the concept of supervised learning, as well as some examples of problems that can be solved using these techniques. While supervised learning involves training an algorithm to map the input information to corresponding known outputs, unsupervised learning methods, by contrast, do not utilize known outputs, either because they are not available or even known. Rather than relying on a set of manually annotated labels, unsupervised learning methods model the supplied data through specific constraints or rules designed into the training process.

Clustering analysis is a common form of unsupervised learning where a dataset is to be divided into a specified number of different groups based on the clustering process being used. In the case of k-nearest neighbors clustering, each sample from the dataset is labeled or classified in accordance with the majority vote of the k-closest points to the sample. As there are no manually identified labels, the performance of unsupervised algorithms can vary greatly with the data being used, as well as the selected parameters of the model. For example, should we use the 5 closest or 10 closest points in the majority vote of the k-closest points? The lack of known and target outputs during training leads to unsupervised methods being commonly used in exploratory analysis or in scenarios where the ground truth targets are somewhat ambiguous and are better defined by the constraints of the learning method.

We will not cover unsupervised learning in great detail in this book, but it is useful to summarize the main difference between the two methods. Supervised learning methods require ground truth labels or the answers for the input data, while unsupervised methods do not use such labels, and the final result is determined by the constraints applied during the training process.

Why Python?

So, why have we chosen the Python programming language for our investigation into supervised machine learning? There are a number of alternative languages available, including C++, R, and Julia. Even the Rust community is developing machine learning libraries for their up-and-coming language. There are a number of reasons why Python is the first-choice language for machine learning:

There is great demand for developers with Python expertise in both industry and academic research.
Python is currently one of the most popular programming languages, even reaching the number one spot in IEEE Spectrum magazine's survey of the top 10 programming languages (https://spectrum.ieee.org/at-work/innovation/the-2018-top-programming-languages).
Python is an open source project, with the entire source code for the Python programming language being freely available under the GNU GPL Version 2 license. This licensing mechanism has allowed Python to be used, modified, and even extended in a number of other projects, including the Linux operating system, supporting NASA (https://www.python.org/about/success/usa/), and a plethora of other libraries and projects that have provided additional functionality, choice, and flexibility to the Python programming language. In our opinion, this flexibility is one of the key components that has made Python so popular.
Python provides a common set of features that can be used to run a web server, a microservice on an embedded device, or to leverage the power of graphical processing units to perform precise calculations on large datasets.
Using Python and a handful of specific libraries (or packages, as they are known in Python), an entire machine learning product can be developed—starting with exploratory data analysis, model definition, and refinement, through to API construction and deployment. All of these steps can be completed within Python to build an end-to-end solution. This is the significant advantage Python has over some of its competitors, particularly within the data science and machine learning space. While R and Julia have the advantage of being specifically designed for numerical and statistical computing, models developed in these languages typically require translation into some other language before they can be deployed in a production setting.

We hope that, through this book, you will gain an understanding of the flexibility and power of the Python programming language and will start on the path of developing end-to-end supervised learning solutions in Python. So, let's get started.