Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
scikit-learn Cookbook , Second Edition
scikit-learn Cookbook , Second Edition

scikit-learn Cookbook , Second Edition: Over 80 recipes for machine learning in Python with scikit-learn , Second Edition

Arrow left icon
Profile Icon Trent Hauck
Arrow right icon
Free Trial
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.7 (3 Ratings)
Paperback Nov 2017 374 pages 2nd Edition
eBook
NZ$51.99
Paperback
NZ$64.99
Subscription
Free Trial
Arrow left icon
Profile Icon Trent Hauck
Arrow right icon
Free Trial
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.7 (3 Ratings)
Paperback Nov 2017 374 pages 2nd Edition
eBook
NZ$51.99
Paperback
NZ$64.99
Subscription
Free Trial
eBook
NZ$51.99
Paperback
NZ$64.99
Subscription
Free Trial

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

scikit-learn Cookbook , Second Edition

High-Performance Machine Learning – NumPy

In this chapter, we will cover the following recipes:

  • NumPy basics
  • Loading the iris dataset
  • Viewing the iris dataset
  • Viewing the iris dataset with pandas
  • Plotting with NumPy and matplotlib
  • A minimal machine learning recipe – SVM classification
  • Introducing cross-validation
  • Putting it all together
  • Machine learning overview – classification versus regression

Introduction

In this chapter, we'll learn how to make predictions with scikit-learn. Machine learning emphasizes on measuring the ability to predict, and with scikit-learn we will predict accurately and quickly.

We will examine the iris dataset, which consists of measurements of three types of Iris flowers: Iris Setosa, Iris Versicolor, and Iris Virginica.

To measure the strength of the predictions, we will:

  • Save some data for testing
  • Build a model using only training data
  • Measure the predictive power on the test set

The prediction—one of three flower types is categorical. This type of problem is called a classification problem.

Informally, classification asks, Is it an apple or an orange? Contrast this with machine learning regression, which asks, How many apples? By the way, the answer can be 4.5 apples for regression.

By the evolution of its design, scikit-learn addresses machine learning mainly via four categories:

  • Classification:
    • Non-text classification, like the Iris flowers example
    • Text classification
  • Regression
  • Clustering
  • Dimensionality reduction

NumPy basics

Data science deals in part with structured tables of data. The scikit-learn library requires input tables of two-dimensional NumPy arrays. In this section, you will learn about the numpy library.

How to do it...

We will try a few operations on NumPy arrays. NumPy arrays have a single type for all of their elements and a predefined shape. Let us look first at their shape.

The shape and dimension of NumPy arrays

  1. Start by importing NumPy:
import numpy as np
  1. Produce a NumPy array of 10 digits, similar to Python's range(10) method:
np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
  1. The array looks like a Python list with only one pair of brackets. This means it is of one dimension. Store the array and find out the shape:
array_1 = np.arange(10)
array_1.shape
(10L,)
  1. The array has a data attribute, shape. The type of array_1.shape is a tuple (10L,), which has length 1, in this case. The number of dimensions is the same as the length of the tuple—a dimension of 1, in this case:
array_1.ndim      #Find number of dimensions of array_1
1
  1. The array has 10 elements. Reshape the array by calling the reshape method:
array_1.reshape((5,2))
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
  1. This reshapes the array into 5 x 2 data object that resembles a list of lists (a three dimensional NumPy array looks like a list of lists of lists). You did not save the changes. Save the reshaped array as follows::
array_1 = array_1.reshape((5,2))
  1. Note that array_1 is now two-dimensional. This is expected, as its shape has two numbers and it looks like a Python list of lists:
array_1.ndim
2

NumPy broadcasting

  1. Add 1 to every element of the array by broadcasting. Note that changes to the array are not saved:
array_1 + 1
array([[ 1, 2],
[ 3, 4],
[ 5, 6],
[ 7, 8],
[ 9, 10]])

The term broadcasting refers to the smaller array being stretched or broadcast across the larger array. In the first example, the scalar 1 was stretched to a 5 x 2 shape and then added to array_1.

  1. Create a new array_2 array. Observe what occurs when you multiply the array by itself (this is not matrix multiplication; it is element-wise multiplication of arrays):
array_2 = np.arange(10)
array_2 * array_2
array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81])
  1. Every element has been squared. Here, element-wise multiplication has occurred. Here is a more complicated example:
array_2 = array_2 ** 2  #Note that this is equivalent to array_2 * array_2
array_2 = array_2.reshape((5,2))
array_2
array([[ 0, 1],
[ 4, 9],
[16, 25],
[36, 49],
[64, 81]])
  1. Change array_1 as well:
array_1 = array_1 + 1
array_1
array([[ 1, 2],
[ 3, 4],
[ 5, 6],
[ 7, 8],
[ 9, 10]])
  1. Now add array_1 and array_2 element-wise by simply placing a plus sign between the arrays:
array_1 + array_2
array([[ 1, 3],
[ 7, 13],
[21, 31],
[43, 57],
[73, 91]])
  1. The formal broadcasting rules require that whenever you are comparing the shapes of both arrays from right to left, all the numbers have to either match or be one. The shapes 5 X 2 and 5 X 2 match for both entries from right to left. However, the shape 5 X 2 X 1 does not match 5 X 2, as the second values from the right, 2 and 5 respectively, are mismatched:

Initializing NumPy arrays and dtypes

There are several ways to initialize NumPy arrays besides np.arange:

  1. Initialize an array of zeros with np.zeros. The np.zeros((5,2)) command creates a 5 x 2 array of zeros:
np.zeros((5,2))
array([[ 0., 0.],
[ 0., 0.],
[ 0., 0.],
[ 0., 0.],
[ 0., 0.]])
  1. Initialize an array of ones using np.ones. Introduce a dtype argument, set to np.int, to ensure that the ones are of NumPy integer type. Note that scikit-learn expects np.float arguments in arrays. The dtype refers to the type of every element in a NumPy array. It remains the same throughout the array. Every single element of the array below has a np.int integer type.
np.ones((5,2), dtype = np.int)
array([[1, 1],
[1, 1],
[1, 1],
[1, 1],
[1, 1]])
  1. Use np.empty to allocate memory for an array of a specific size and dtype, but no particular initialized values:
np.empty((5,2), dtype = np.float)
array([[ 3.14724935e-316, 3.14859499e-316],
[ 3.14858945e-316, 3.14861159e-316],
[ 3.14861435e-316, 3.14861712e-316],
[ 3.14861989e-316, 3.14862265e-316],
[ 3.14862542e-316, 3.14862819e-316]])
  1. Use np.zeros, np.ones, and np.empty to allocate memory for NumPy arrays with different initial values.

Indexing

  1. Look up the values of the two-dimensional arrays with indexing:
array_1[0,0]   #Finds value in first row and first column.
1
  1. View the first row:
array_1[0,:]
array([1, 2])
  1. Then view the first column:
array_1[:,0]
array([1, 3, 5, 7, 9])
  1. View specific values along both axes. Also view the second to the fourth rows:
array_1[2:5, :]
array([[ 5, 6],
[ 7, 8],
[ 9, 10]])
  1. View the second to the fourth rows only along the first column:
array_1[2:5,0]
array([5, 7, 9])

Boolean arrays

Additionally, NumPy handles indexing with Boolean logic:

  1. First produce a Boolean array:
array_1 > 5
array([[False, False],

[False, False],
[False, True],
[ True, True],
[ True, True]], dtype=bool)
  1. Place brackets around the Boolean array to filter by the Boolean array:
array_1[array_1 > 5]
array([ 6, 7, 8, 9, 10])

Arithmetic operations

  1. Add all the elements of the array with the sum method. Go back to array_1:
array_1
array([[ 1, 2],
[ 3, 4],
[ 5, 6],
[ 7, 8],
[ 9, 10]])
array_1.sum()
55
  1. Find all the sums by row:
array_1.sum(axis = 1)
array([ 3, 7, 11, 15, 19])
  1. Find all the sums by column:
array_1.sum(axis = 0)
array([25, 30])
  1. Find the mean of each column in a similar way. Note that the dtype of the array of averages is np.float:
array_1.mean(axis = 0)
array([ 5., 6.])

NaN values

  1. Scikit-learn will not accept np.nan values. Take array_3 as follows:
array_3 = np.array([np.nan, 0, 1, 2, np.nan])
  1. Find the NaN values with a special Boolean array created by the np.isnan function:
np.isnan(array_3)
array([ True, False, False, False, True], dtype=bool)
  1. Filter the NaN values by negating the Boolean array with the symbol ~ and placing brackets around the expression:
array_3[~np.isnan(array_3)]
>array([ 0., 1., 2.])
  1. Alternatively, set the NaN values to zero:
array_3[np.isnan(array_3)] = 0
array_3
array([ 0., 0., 1., 2., 0.])

How it works...

Data, in the present and minimal sense, is about 2D tables of numbers, which NumPy handles very well. Keep this in mind in case you forget the NumPy syntax specifics. Scikit-learn accepts only 2D NumPy arrays of real numbers with no missing np.nan values.

From experience, it tends to be best to change np.nan to some value instead of throwing away data. Personally, I like to keep track of Boolean masks and keep the data shape roughly the same, as this leads to fewer coding errors and more coding flexibility.

Loading the iris dataset

To perform machine learning with scikit-learn, we need some data to start with. We will load the iris dataset, one of the several datasets available in scikit-learn.

Getting ready

A scikit-learn program begins with several imports. Within Python, preferably in Jupyter Notebook, load the numpy, pandas, and pyplot libraries:

import numpy as np    #Load the numpy library for fast array computations
import pandas as pd #Load the pandas data-analysis library
import matplotlib.pyplot as plt #Load the pyplot visualization library

If you are within a Jupyter Notebook, type the following to see a graphical output instantly:

%matplotlib inline 

How to do it...

  1. From the scikit-learn datasets module, access the iris dataset:
from sklearn import datasets
iris = datasets.load_iris()

How it works...

Similarly, you could have imported the diabetes dataset as follows:

from sklearn import datasets  #Import datasets module from scikit-learn
diabetes = datasets.load_diabetes()

There! You've loaded diabetes using the load_diabetes() function of the datasets module. To check which datasets are available, type:

datasets.load_*?

Once you try that, you might observe that there is a dataset named datasets.load_digits. To access it, type the load_digits() function, analogous to the other loading functions:

digits = datasets.load_digits()

To view information about the dataset, type digits.DESCR.

Viewing the iris dataset

Now that we've loaded the dataset, let's examine what is in it. The iris dataset pertains to a supervised classification problem.

How to do it...

  1. To access the observation variables, type:
iris.data

This outputs a NumPy array:

array([[ 5.1,  3.5,  1.4,  0.2],
[ 4.9, 3. , 1.4, 0.2],
[ 4.7, 3.2, 1.3, 0.2],
#...rest of output suppressed because of length
  1. Let's examine the NumPy array:
iris.data.shape

This returns:

(150L, 4L)

This means that the data is 150 rows by 4 columns. Let's look at the first row:

iris.data[0]

array([ 5.1, 3.5, 1.4, 0.2])

The NumPy array for the first row has four numbers.

  1. To determine what they mean, type:
iris.feature_names
['sepal length (cm)',

'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']

The feature or column names name the data. They are strings, and in this case, they correspond to dimensions in different types of flowers. Putting it all together, we have 150 examples of flowers with four measurements per flower in centimeters. For example, the first flower has measurements of 5.1 cm for sepal length, 3.5 cm for sepal width, 1.4 cm for petal length, and 0.2 cm for petal width. Now, let's look at the output variable in a similar manner:

iris.target

This yields an array of outputs: 0, 1, and 2. There are only three outputs. Type this:

iris.target.shape

You get a shape of:

(150L,)

This refers to an array of length 150 (150 x 1). Let's look at what the numbers refer to:

iris.target_names

array(['setosa', 'versicolor', 'virginica'],
dtype='|S10')

The output of the iris.target_names variable gives the English names for the numbers in the iris.target variable. The number zero corresponds to the setosa flower, number one corresponds to the versicolor flower, and number two corresponds to the virginica flower. Look at the first row of iris.target:

iris.target[0]

This produces zero, and thus the first row of observations we examined before correspond to the setosa flower.

How it works...

In machine learning, we often deal with data tables and two-dimensional arrays corresponding to examples. In the iris set, we have 150 observations of flowers of three types. With new observations, we would like to predict which type of flower those observations correspond to. The observations in this case are measurements in centimeters. It is important to look at the data pertaining to real objects. Quoting my high school physics teacher, "Do not forget the units!"

The iris dataset is intended to be for a supervised machine learning task because it has a target array, which is the variable we desire to predict from the observation variables. Additionally, it is a classification problem, as there are three numbers we can predict from the observations, one for each type of flower. In a classification problem, we are trying to distinguish between categories. The simplest case is binary classification. The iris dataset, with three flower categories, is a multi-class classification problem.

There's more...

With the same data, we can rephrase the problem in many ways, or formulate new problems. What if we want to determine relationships between the observations? We can define the petal width as the target variable. We can rephrase the problem as a regression problem and try to predict the target variable as a real number, not just three categories. Fundamentally, it comes down to what we intend to predict. Here, we desire to predict a type of flower.

Viewing the iris dataset with Pandas

In this recipe we will use the handy pandas data analysis library to view and visualize the iris dataset. It contains the notion o, a dataframe which might be familiar to you if you use the language R's dataframe.

How to do it...

You can view the iris dataset with Pandas, a library built on top of NumPy:

  1. Create a dataframe with the observation variables iris.data, and column names columns, as arguments:
import pandas as pd
iris_df = pd.DataFrame(iris.data, columns = iris.feature_names)

The dataframe is more user-friendly than the NumPy array.

  1. Look at a quick histogram of the values in the dataframe for sepal length:
iris_df['sepal length (cm)'].hist(bins=30)
  1. You can also color the histogram by the target variable:
for class_number in np.unique(iris.target):
plt.figure(1)
iris_df['sepal length (cm)'].iloc[np.where(iris.target == class_number)[0]].hist(bins=30)
  1. Here, iterate through the target numbers for each flower and draw a color histogram for each. Consider this line:
np.where(iris.target== class_number)[0]

It finds the NumPy index location for each class of flower:

Observe that the histograms overlap. This encourages us to model the three histograms as three normal distributions. This is possible in a machine learning manner if we model the training data only as three normal distributions, not the whole set. Then we use the test set to test the three normal distribution models we just made up. Finally, we test the accuracy of our predictions on the test set.

How it works...

The dataframe data object is a 2D NumPy array with column names and row names. In data science, the fundamental data object looks like a 2D table, possibly because of SQL's long history. NumPy allows for 3D arrays, cubes, 4D arrays, and so on. These also come up often.

Plotting with NumPy and matplotlib

A simple way to make visualizations with NumPy is by using the library matplotlib. Let's make some visualizations quickly.

Getting ready

Start by importing numpy and matplotlib. You can view visualizations within an IPython Notebook using the %matplotlib inline command:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

How to do it...

  1. The main command in matplotlib, in pseudo code, is as follows:
plt.plot(numpy array, numpy array of same length)
  1. Plot a straight line by placing two NumPy arrays of the same length:
plt.plot(np.arange(10), np.arange(10))
  1. Plot an exponential:
plt.plot(np.arange(10), np.exp(np.arange(10)))
  1. Place the two graphs side by side:
plt.figure()
plt.subplot(121)
plt.plot(np.arange(10), np.exp(np.arange(10)))
plt.subplot(122)
plt.scatter(np.arange(10), np.exp(np.arange(10)))

Or top to bottom:

plt.figure()
plt.subplot(211)
plt.plot(np.arange(10), np.exp(np.arange(10)))
plt.subplot(212)
plt.scatter(np.arange(10), np.exp(np.arange(10)))

The first two numbers in the subplot command refer to the grid size in the figure instantiated by plt.figure(). The grid size referred to in plt.subplot(221) is 2 x 2, the first two digits. The last digit refers to traversing the grid in reading order: left to right and then up to down.

  1. Plot in a 2 x 2 grid traversing in reading order from one to four:
plt.figure()
plt.subplot(221)
plt.plot(np.arange(10), np.exp(np.arange(10)))
plt.subplot(222)
plt.scatter(np.arange(10), np.exp(np.arange(10)))
plt.subplot(223)
plt.scatter(np.arange(10), np.exp(np.arange(10)))
plt.subplot(224)
plt.scatter(np.arange(10), np.exp(np.arange(10)))
  1. Finally, with real data:
from sklearn.datasets import load_iris

iris = load_iris()
data = iris.data
target = iris.target

# Resize the figure for better viewing
plt.figure(figsize=(12,5))

# First subplot
plt.subplot(121)

# Visualize the first two columns of data:
plt.scatter(data[:,0], data[:,1], c=target)

# Second subplot
plt.subplot(122)

# Visualize the last two columns of data:
plt.scatter(data[:,2], data[:,3], c=target)

The c parameter takes an array of colors—in this case, the colors 0, 1, and 2 in the iris target:

A minimal machine learning recipe – SVM classification

Machine learning is all about making predictions. To make predictions, we will:

  • State the problem to be solved
  • Choose a model to solve the problem
  • Train the model
  • Make predictions
  • Measure how well the model performed

Getting ready

Back to the iris example, we now store the first two features (columns) of the observations as X and the target as y, a convention in the machine learning community:

X = iris.data[:, :2]  
y = iris.target

How to do it...

  1. First, we state the problem. We are trying to determine the flower-type category from a set of new observations. This is a classification task. The data available includes a target variable, which we have named y. This is a supervised classification problem.
The task of supervised learning involves predicting values of an output variable with a model that trains using input variables and an output variable.
  1. Next, we choose a model to solve the supervised classification. For now, we will use a support vector classifier. Because of its simplicity and interpretability, it is a commonly used algorithm (interpretable means easy to read into and understand).
  2. To measure the performance of prediction, we will split the dataset into training and test sets. The training set refers to data we will learn from. The test set is the data we hold out and pretend not to know as we would like to measure the performance of our learning procedure. So, import a function that will split the dataset:
from sklearn.model_selection import train_test_split
  1. Apply the function to both the observation and target data:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

The test size is 0.25 or 25% of the whole dataset. A random state of one fixes the random seed of the function so that you get the same results every time you call the function, which is important for now to reproduce the same results consistently.

  1. Now load a regularly used estimator, a support vector machine:
from sklearn.svm import SVC
  1. You have imported a support vector classifier from the svm module. Now create an instance of a linear SVC:
clf = SVC(kernel='linear',random_state=1)

The random state is fixed to reproduce the same results with the same code later.

The supervised models in scikit-learn implement a fit(X, y) method, which trains the model and returns the trained model. X is a subset of the observations, and each element of y corresponds to the target of each observation in X. Here, we fit a model on the training data:

clf.fit(X_train, y_train)

Now, the clf variable is the fitted, or trained, model.

The estimator also has a predict(X) method that returns predictions for several unlabeled observations, X_test, and returns the predicted values, y_pred. Note that the function does not return the estimator. It returns a set of predictions:

y_pred = clf.predict(X_test)

So far, you have done all but the last step. To examine the model performance, load a scorer from the metrics module:

from sklearn.metrics import accuracy_score

With the scorer, compare the predictions with the held-out test targets:

accuracy_score(y_test,y_pred)

0.76315789473684215

How it works...

Without knowing very much about the details of support vector machines, we have implemented a predictive model. To perform machine learning, we held out one-fourth of the data and examined how the SVC performed on that data. In the end, we obtained a number that measures accuracy, or how the model performed.

There's more...

To summarize, we will do all the steps with a different algorithm, logistic regression:

  1. First, import LogisticRegression:
from sklearn.linear_model import LogisticRegression
  1. Then write a program with the modeling steps:
    1. Split the data into training and testing sets.
    2. Fit the logistic regression model.
    3. Predict using the test observations.
    4. Measure the accuracy of the predictions with y_test versus y_pred:
import matplotlib.pyplot as plt
from sklearn import datasets

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X = iris.data[:, :2] #load the iris data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

#train the model
clf = LogisticRegression(random_state = 1)
clf.fit(X_train, y_train)

#predict with Logistic Regression
y_pred = clf.predict(X_test)

#examine the model accuracy
accuracy_score(y_test,y_pred)

0.60526315789473684

This number is lower; yet we cannot make any conclusions comparing the two models, SVC and logistic regression classification. We cannot compare them, because we were not supposed to look at the test set for our model. If we made a choice between SVC and logistic regression, the choice would be part of our model as well, so the test set cannot be involved in the choice. Cross-validation, which we will look at next, is a way to choose between models.

Introducing cross-validation

We are thankful for the iris dataset, but as you might recall, it has only 150 observations. To make the most out of the set, we will employ cross-validation. Additionally, in the last section, we wanted to compare the performance of two different classifiers, support vector classifier and logistic regression. Cross-validation will help us with this comparison issue as well.

Getting ready

Suppose we wanted to choose between the support vector classifier and the logistic regression classifier. We cannot measure their performance on the unavailable test set.

What if, instead, we:

  • Forgot about the test set for now?
  • Split the training set into two parts, one to train on and one to test the training?

Split the training set into two parts using the train_test_split function used in previous sections:

from sklearn.model_selection import train_test_split
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_train, y_train, test_size=0.25, random_state=1)

X_train_2 consists of 75% of the X_train data, while X_test_2 is the remaining 25%. y_train_2 is 75% of the target data, and matches the observations of X_train_2. y_test_2 is 25% of the target data present in y_train.

As you might have expected, you have to use these new splits to choose between the two models: SVC and logistic regression. Do so by writing a predictive program.

How to do it...

  1. Start with all the imports and load the iris dataset:
from sklearn import datasets

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

#load the classifying models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

iris = datasets.load_iris()
X = iris.data[:, :2] #load the first two features of the iris data
y = iris.target #load the target of the iris data

#split the whole set one time
#Note random state is 7 now
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=7)

#split the training set into parts
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_train, y_train, test_size=0.25, random_state=7)
  1. Create an instance of an SVC classifier and fit it:
svc_clf = SVC(kernel = 'linear',random_state = 7)
svc_clf.fit(X_train_2, y_train_2)
  1. Do the same for logistic regression (both lines for logistic regression are compressed into one):
lr_clf = LogisticRegression(random_state = 7).fit(X_train_2, y_train_2)
  1. Now predict and examine the SVC and logistic regression's performance on X_test_2:
svc_pred = svc_clf.predict(X_test_2)
lr_pred = lr_clf.predict(X_test_2)

print "Accuracy of SVC:",accuracy_score(y_test_2,svc_pred)
print "Accuracy of LR:",accuracy_score(y_test_2,lr_pred)

Accuracy of SVC: 0.857142857143
Accuracy of LR: 0.714285714286
  1. The SVC performs better, but we have not yet seen the original test data. Choose SVC over logistic regression and try it on the original test set:
print "Accuracy of SVC on original Test Set: ",accuracy_score(y_test, svc_clf.predict(X_test))

Accuracy of SVC on original Test Set: 0.684210526316

How it works...

In comparing the SVC and logistic regression classifier, you might wonder (and be a little suspicious) about a lot of scores being very different. The final test on SVC scored lower than logistic regression. To help with this situation, we can do cross-validation in scikit-learn.

Cross-validation involves splitting the training set into parts, as we did before. To match the preceding example, we split the training set into four parts, or folds. We are going to design a cross-validation iteration by taking turns with one of the four folds for testing and the other three for training. It is the same split as done before four times over with the same set, thereby rotating, in a sense, the test set:

With scikit-learn, this is relatively easy to accomplish:

  1. We start with an import:
from sklearn.model_selection import cross_val_score
  1. Then we produce an accuracy score on four folds:
svc_scores = cross_val_score(svc_clf, X_train, y_train, cv=4)
svc_scores

array([ 0.82758621, 0.85714286, 0.92857143, 0.77777778])
  1. We can find the mean for average performance and standard deviation for a measure of spread of all scores relative to the mean:
print "Average SVC scores: ", svc_scores.mean()
print "Standard Deviation of SVC scores: ", svc_scores.std()

Average SVC scores: 0.847769567597
Standard Deviation of SVC scores: 0.0545962864696
  1. Similarly, with the logistic regression instance, we compute four scores:
lr_scores = cross_val_score(lr_clf, X_train, y_train, cv=4)
print "Average SVC scores: ", lr_scores.mean()
print "Standard Deviation of SVC scores: ", lr_scores.std()

Average SVC scores: 0.748893906221
Standard Deviation of SVC scores: 0.0485633168699

Now we have many scores, which confirms our selection of SVC over logistic regression. Thanks to cross-validation, we used the training multiple times and had four small test sets within it to score our model.

Note that our model is a bigger model that consists of:

  • Training an SVM through cross-validation
  • Training a logistic regression through cross-validation
  • Choosing between SVM and logistic regression
The choice at the end is part of the model.

There's more...

Despite our hard work and the elegance of the scikit-learn syntax, the score on the test set at the very end remains suspicious. The reason for this is that the test and train split are not necessarily balanced; the train and test sets do not necessarily have similar proportions of all the classes.

This is easily remedied by using a stratified test-train split:

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

By selecting the target set as the stratified argument, the target classes are balanced. This brings the SVC scores closer together.

svc_scores = cross_val_score(svc_clf, X_train, y_train, cv=4)
print "Average SVC scores: " , svc_scores.mean()
print "Standard Deviation of SVC scores: ", svc_scores.std()
print "Score on Final Test Set:", accuracy_score(y_test, svc_clf.predict(X_test))

Average SVC scores: 0.831547619048
Standard Deviation of SVC scores: 0.0792488953372
Score on Final Test Set: 0.789473684211

Additionally, note that in the preceding example, the cross-validation procedure produces stratified folds by default:

from sklearn.model_selection import cross_val_score
svc_scores = cross_val_score(svc_clf, X_train, y_train, cv = 4)

The preceding code is equivalent to:

from sklearn.model_selection import cross_val_score, StratifiedKFold
skf = StratifiedKFold(n_splits = 4)
svc_scores = cross_val_score(svc_clf, X_train, y_train, cv = skf)

Putting it all together

Now, we are going to perform the same procedure as before, except that we will reset, regroup, and try a new algorithm: K-Nearest Neighbors (KNN).

How to do it...

  1. Start by importing the model from sklearn, followed by a balanced split:
from sklearn.neighbors import KNeighborsClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state = 0)
The random_state parameter fixes the random_seed in the function train_test_split. In the preceding example, the random_state is set to zero and can be set to any integer.
  1. Construct two different KNN models by varying the n_neighbors parameter. Observe that the number of folds is now 10. Tenfold cross-validation is common in the machine learning community, particularly in data science competitions:
from sklearn.model_selection import cross_val_score
knn_3_clf = KNeighborsClassifier(n_neighbors = 3)
knn_5_clf = KNeighborsClassifier(n_neighbors = 5)

knn_3_scores = cross_val_score(knn_3_clf, X_train, y_train, cv=10)
knn_5_scores = cross_val_score(knn_5_clf, X_train, y_train, cv=10)
  1. Score and print out the scores for selection:
print "knn_3 mean scores: ", knn_3_scores.mean(), "knn_3 std: ",knn_3_scores.std()
print "knn_5 mean scores: ", knn_5_scores.mean(), " knn_5 std: ",knn_5_scores.std()

knn_3 mean scores: 0.798333333333 knn_3 std: 0.0908142181722
knn_5 mean scores: 0.806666666667 knn_5 std: 0.0559320575496

Both nearest neighbor types score similarly, yet the KNN with parameter n_neighbors = 5 is a bit more stable. This is an example of hyperparameter optimization which we will examine closely throughout the book.

There's more...

You could have just as easily run a simple loop to score the function more quickly:

all_scores = []
for n_neighbors in range(3,9,1):
knn_clf = KNeighborsClassifier(n_neighbors = n_neighbors)
all_scores.append((n_neighbors, cross_val_score(knn_clf, X_train, y_train, cv=10).mean()))
sorted(all_scores, key = lambda x:x[1], reverse = True)

Its output suggests that n_neighbors = 4 is a good choice:

[(4, 0.85111111111111115),
(7, 0.82611111111111113),
(6, 0.82333333333333347),
(5, 0.80666666666666664),
(3, 0.79833333333333334),
(8, 0.79833333333333334)]

Machine learning overview – classification versus regression

In this recipe we will examine how regression can be viewed as being very similar to classification. This is done by reconsidering the categorical labels of regression as real numbers. In this section we will also look at at several aspects of machine learning from a very broad perspective including the purpose of scikit-learn. scikit-learn allows us to find models that work well incredibly quickly. We do not have to work out all the details of the model, or optimize, until we found one that works well. Consequently, your company saves precious development time and computational resources thanks to scikit-learn giving us the ability to develop models relatively quickly.

The purpose of scikit-learn

As we have seen before, scikit-learn allowed us to find a model that works fairly quickly. We tried SVC, logistic regression, and a few KNN classifiers. Through cross-validation, we selected models that performed better than others. In industry, after trying SVMs and logistic regression, we might focus on SVMs and optimize them further. Thanks to scikit-learn, we saved a lot of time and resources, including mental energy. After optimizing the SVM at work on a realistic dataset, we might re-implement it for speed in Java or C and gather more data.

Supervised versus unsupervised

Classification and regression are supervised, as we know the target variables for the observations. Clustering—creating regions in space for each category without being given any labels is unsupervised learning.

Getting ready

In classification, the target variable is one of several categories, and there must be more than one instance of every category. In regression, there can be only one instance of every target variable, as the only requirement is that the target is a real number.

In the case of logistic regression, we saw previously that the algorithm first performs a regression and estimates a real number for the target. Then the target class is estimated by using thresholds. In scikit-learn, there are predict_proba methods that yield probabilistic estimates, which relate regression-like real number estimates with classification classes in the style of logistic regression.

Any regression can be turned into classification by using thresholds. A binary classification can be viewed as a regression problem by using a regressor. The target variables produced will be real numbers, not the original class variables.

How to do it...

Quick SVC – a classifier and regressor

  1. Load iris from the datasets module:
import numpy as np
import pandas as pd
from sklearn import datasets

iris = datasets.load_iris()
  1. For simplicity, consider only targets 0 and 1, corresponding to Setosa and Versicolor. Use the Boolean array iris.target < 2 to filter out target 2. Place it within brackets to use it as a filter in defining the observation set X and the target set y:
X = iris.data[iris.target < 2]
y = iris.target[iris.target < 2]
  1. Now import train_test_split and apply it:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state= 7)
  1. Prepare and run an SVC by importing it and scoring it with cross-validation:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

svc_clf = SVC(kernel = 'linear').fit(X_train, y_train)
svc_scores = cross_val_score(svc_clf, X_train, y_train, cv=4)
  1. As done in previous sections, view the average of the scores:
svc_scores.mean()

0.94795321637426899
  1. Perform the same with support vector regression by importing SVR from sklearn.svm, the same module that contains SVC:
from sklearn.svm import SVR
  1. Then write the necessary syntax to fit the model. It is almost identical to the syntax for SVC, just replacing some c keywords with r:
svr_clf = SVR(kernel = 'linear').fit(X_train, y_train)

Making a scorer

To make a scorer, you need:

  • A scoring function that compares y_test, the ground truth, with y_pred, the predictions
  • To determine whether a high score is good or bad

Before passing the SVR regressor to the cross-validation, make a scorer by supplying two elements:

  1. In practice, begin by importing the make_scorer function:
from sklearn.metrics import make_scorer
  1. Use this sample scoring function:
#Only works for this iris example with targets 0 and 1
def for_scorer(y_test, orig_y_pred):
y_pred = np.rint(orig_y_pred).astype(np.int) #rounds prediction to the nearest integer
return accuracy_score(y_test, y_pred)

The np.rint function rounds off the prediction to the nearest integer, hopefully one of the targets, 0 or 1. The astype method changes the type of the prediction to integer type, as the original target is in integer type and consistency is preferred with regard to types. After the rounding occurs, the scoring function uses the old accuracy_score function, which you are familiar with.

  1. Now, determine whether a higher score is better. Higher accuracy is better, so for this situation, a higher score is better. In scikit code:
svr_to_class_scorer = make_scorer(for_scorer, greater_is_better=True) 
  1. Finally, run the cross-validation with a new parameter, the scoring parameter:
svr_scores = cross_val_score(svr_clf, X_train, y_train, cv=4, scoring = svr_to_class_scorer)
  1. Find the mean:
svr_scores.mean()

0.94663742690058483

The accuracy scores are similar for the SVR regressor-based classifier and the traditional SVC classifier.

How it works...

You might ask, why did we take out class 2 out of the target set?

The reason is that, to use a regressor, our intent has to be to predict a real number. The categories had to have real number properties: that they are ordered (informally, if we have three ordered categories x, y, z and x < y and y < z then x < z). By eliminating the third category, the remaining flowers (Setosa and Versicolor) became ordered by a property we invented: Setosaness or Versicolorness.

The next time you encounter categories, you can consider whether they can be ordered. For example, if the dataset consists of shoe sizes, they can be ordered and a regressor can be applied, even though no one has a shoe size of 12.125.

There's more...

Linear versus nonlinear

Linear algorithms involve lines or hyperplanes. Hyperplanes are flat surfaces in any n-dimensional space. They tend to be easy to understand and explain, as they involve ratios (with an offset). Some functions that consistently and monotonically increase or decrease can be mapped to a linear function with a transformation. For example, exponential growth can be mapped to a line with the log transformation.

Nonlinear algorithms tend to be tougher to explain to colleagues and investors, yet ensembles of decision trees that are nonlinear tend to perform very well. KNN, which we examined earlier, is nonlinear. In some cases, functions not increasing or decreasing in a familiar manner are acceptable for the sake of accuracy.

Try a simple SVC with a polynomial kernel, as follows:

from sklearn.svm import SVC   #Usual import of SVC
svc_poly_clf = SVC(kernel = 'poly', degree= 3).fit(X_train, y_train) #Polynomial Kernel of Degree 3

The polynomial kernel of degree 3 looks like a cubic curve in two dimensions. It leads to a slightly better fit, but note that it can be harder to explain to others than a linear kernel with consistent behavior throughout all of the Euclidean space:

svc_poly_scores = cross_val_score(svc_clf, X_train, y_train, cv=4)
svc_poly_scores.mean()

0.95906432748538006

Black box versus not

For the sake of efficiency, we did not examine the classification algorithms used very closely. When we compared SVC and logistic regression, we chose SVMs. At that point, both algorithms were black boxes, as we did not know any internal details. Once we decided to focus on SVMs, we could proceed to compute coefficients of the separating hyperplanes involved, optimize the hyperparameters of the SVM, use the SVM for big data, and do other processes. The SVMs have earned our time investment because of their superior performance.

Interpretability

Some machine learning algorithms are easier to understand than others. These are usually easier to explain to others as well. For example, linear regression is well known and easy to understand and explain to potential investors of your company. SVMs are more difficult to entirely understand.

My general advice: if SVMs are highly effective for a particular dataset, try to increase your personal interpretability of SVMs in the particular problem context. Also, consider merging algorithms somehow, using linear regression as an input to SVMs, for example. This way, you have the best of both worlds.

This is really context-specific, however. Linear SVMs are relatively simple to visualize and understand. Merging linear regression with SVM could complicate things. You can start by comparing them side by side.

However, if you cannot understand every detail of the math and practice of SVMs, be kind to yourself, as machine learning is focused more on prediction performance rather than traditional statistics.

A pipeline

In programming, a pipeline is a set of procedures connected in series, one after the other, where the output of one process is the input to the next:

You can replace any procedure in the process with a different one, perhaps better in some way, without compromising the whole system. For the model in the middle step, you can use an SVC or logistic regression:

One can also keep track of the classifier itself and build a flow diagram from the classifier. Here is a pipeline keeping track of the SVC classifier:

In the upcoming chapters, we will see how scikit-learn uses the intuitive notion of a pipeline. So far, we have used a simple one: train, predict, test.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Handle a variety of machine learning tasks effortlessly by leveraging the power of scikit-learn
  • Perform supervised and unsupervised learning with ease, and evaluate the performance of your model
  • Practical, easy to understand recipes aimed at helping you choose the right machine learning algorithm

Description

Python is quickly becoming the go-to language for analysts and data scientists due to its simplicity and flexibility, and within the Python data space, scikit-learn is the unequivocal choice for machine learning. This book includes walk throughs and solutions to the common as well as the not-so-common problems in machine learning, and how scikit-learn can be leveraged to perform various machine learning tasks effectively. The second edition begins with taking you through recipes on evaluating the statistical properties of data and generates synthetic data for machine learning modelling. As you progress through the chapters, you will comes across recipes that will teach you to implement techniques like data pre-processing, linear regression, logistic regression, K-NN, Naïve Bayes, classification, decision trees, Ensembles and much more. Furthermore, you’ll learn to optimize your models with multi-class classification, cross validation, model evaluation and dive deeper in to implementing deep learning with scikit-learn. Along with covering the enhanced features on model section, API and new features like classifiers, regressors and estimators the book also contains recipes on evaluating and fine-tuning the performance of your model. By the end of this book, you will have explored plethora of features offered by scikit-learn for Python to solve any machine learning problem you come across.

Who is this book for?

Data Analysts already familiar with Python but not so much with scikit-learn, who want quick solutions to the common machine learning problems will find this book to be very useful. If you are a Python programmer who wants to take a dive into the world of machine learning in a practical manner, this book will help you too.

What you will learn

  • Build predictive models in minutes by using scikit-learn
  • Understand the differences and relationships between Classification and Regression, two types of Supervised Learning.
  • Use distance metrics to predict in Clustering, a type of Unsupervised Learning
  • Find points with similar characteristics with Nearest Neighbors.
  • Use automation and cross-validation to find a best model and focus on it for a data product
  • Choose among the best algorithm of many or use them together in an ensemble.
  • Create your own estimator with the simple syntax of sklearn
  • Explore the feed-forward neural networks available in scikit-learn

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Nov 16, 2017
Length: 374 pages
Edition : 2nd
Language : English
ISBN-13 : 9781787286382
Category :
Languages :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Nov 16, 2017
Length: 374 pages
Edition : 2nd
Language : English
ISBN-13 : 9781787286382
Category :
Languages :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just NZ$7 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just NZ$7 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total NZ$ 276.97
scikit-learn : Machine Learning Simplified
NZ$146.99
scikit-learn Cookbook , Second Edition
NZ$64.99
Python Machine Learning, Second Edition
NZ$64.99
Total NZ$ 276.97 Stars icon

Table of Contents

12 Chapters
High-Performance Machine Learning – NumPy Chevron down icon Chevron up icon
Pre-Model Workflow and Pre-Processing Chevron down icon Chevron up icon
Dimensionality Reduction Chevron down icon Chevron up icon
Linear Models with scikit-learn Chevron down icon Chevron up icon
Linear Models – Logistic Regression Chevron down icon Chevron up icon
Building Models with Distance Metrics Chevron down icon Chevron up icon
Cross-Validation and Post-Model Workflow Chevron down icon Chevron up icon
Support Vector Machines Chevron down icon Chevron up icon
Tree Algorithms and Ensembles Chevron down icon Chevron up icon
Text and Multiclass Classification with scikit-learn Chevron down icon Chevron up icon
Neural Networks Chevron down icon Chevron up icon
Create a Simple Estimator Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.7
(3 Ratings)
5 star 66.7%
4 star 0%
3 star 0%
2 star 0%
1 star 33.3%
Jose Luis Ramirez Nov 28, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I've been working with python in my AI projects and I came up upon this great book of recipes that I can use quickly and practically in every stage of my developments. It is ease to use right away and it has reference to the enough amount of theory so you don't have to go to search around for extra info. The book introduces neural networks in a simple way and it has robust OOP for more complex AI projects.
Amazon Verified review Amazon
Rex Jones May 03, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Excellent reference for basic modeling in Python using scikit. Julian Avila has created a great reference that I use it in my class. It's easy to incorporate reading assignments from this book.
Amazon Verified review Amazon
Miss S Betts Apr 22, 2020
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
Purchase book.Open in cloud reader and start flicking through - and the reader starts displaying the word in a single column down the centre of the screen. Can't find settings to stop it.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.