First steps with scikit-learn – training a perceptron
In Chapter 2, Training Simple Machine Learning Algorithms for Classification, you learned about two related learning algorithms for classification, the perceptron rule and Adaline, which we implemented in Python and NumPy by ourselves. Now we will take a look at the scikit-learn API, which, as mentioned, combines a user-friendly and consistent interface with a highly optimized implementation of several classification algorithms. The scikit-learn library offers not only a large variety of learning algorithms, but also many convenient functions to preprocess data and to fine-tune and evaluate our models. We will discuss this in more detail, together with the underlying concepts, in Chapter 4, Building Good Training Datasets – Data Preprocessing, and Chapter 5, Compressing Data via Dimensionality Reduction.
To get started with the scikit-learn library, we will train a perceptron model similar to the one that we implemented in Chapter 2. For simplicity, we will use the already familiar Iris dataset throughout the following sections. Conveniently, the Iris dataset is already available via scikit-learn, since it is a simple yet popular dataset that is frequently used for testing and experimenting with algorithms. Similar to the previous chapter, we will only use two features from the Iris dataset for visualization purposes.
We will assign the petal length and petal width of the 150 flower examples to the feature matrix, X
, and the corresponding class labels of the flower species to the vector array, y
:
>>> from sklearn import datasets
>>> import numpy as np
>>> iris = datasets.load_iris()
>>> X = iris.data[:, [2, 3]]
>>> y = iris.target
>>> print('Class labels:', np.unique(y))
Class labels: [0 1 2]
The np.unique(y)
function returned the three unique class labels stored in iris.target
, and as we can see, the Iris flower class names, Iris-setosa
, Iris-versicolor
, and Iris-virginica
, are already stored as integers (here: 0
, 1
, 2
). Although many scikit-learn functions and class methods also work with class labels in string format, using integer labels is a recommended approach to avoid technical glitches and improve computational performance due to a smaller memory footprint; furthermore, encoding class labels as integers is a common convention among most machine learning libraries.
To evaluate how well a trained model performs on unseen data, we will further split the dataset into separate training and test datasets. In Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning, we will discuss the best practices around model evaluation in more detail. Using the train_test_split
function from scikit-learn’s model_selection
module, we randomly split the X
and y
arrays into 30 percent test data (45 examples) and 70 percent training data (105 examples):
>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=0.3, random_state=1, stratify=y
... )
Note that the train_test_split
function already shuffles the training datasets internally before splitting; otherwise, all examples from class 0
and class 1
would have ended up in the training datasets, and the test dataset would consist of 45 examples from class 2
. Via the random_state
parameter, we provided a fixed random seed (random_state=1
) for the internal pseudo-random number generator that is used for shuffling the datasets prior to splitting. Using such a fixed random_state
ensures that our results are reproducible.
Lastly, we took advantage of the built-in support for stratification via stratify=y
. In this context, stratification means that the train_test_split
method returns training and test subsets that have the same proportions of class labels as the input dataset. We can use NumPy’s bincount
function, which counts the number of occurrences of each value in an array, to verify that this is indeed the case:
>>> print('Labels counts in y:', np.bincount(y))
Labels counts in y: [50 50 50]
>>> print('Labels counts in y_train:', np.bincount(y_train))
Labels counts in y_train: [35 35 35]
>>> print('Labels counts in y_test:', np.bincount(y_test))
Labels counts in y_test: [15 15 15]
Many machine learning and optimization algorithms also require feature scaling for optimal performance, as we saw in the gradient descent example in Chapter 2. Here, we will standardize the features using the StandardScaler
class from scikit-learn’s preprocessing
module:
>>> from sklearn.preprocessing import StandardScaler
>>> sc = StandardScaler()
>>> sc.fit(X_train)
>>> X_train_std = sc.transform(X_train)
>>> X_test_std = sc.transform(X_test)
Using the preceding code, we loaded the StandardScaler
class from the preprocessing
module and initialized a new StandardScaler
object that we assigned to the sc
variable. Using the fit
method, StandardScaler
estimated the parameters, (sample mean) and (standard deviation), for each feature dimension from the training data. By calling the transform
method, we then standardized the training data using those estimated parameters, and . Note that we used the same scaling parameters to standardize the test dataset so that both the values in the training and test dataset are comparable with one another.
Having standardized the training data, we can now train a perceptron model. Most algorithms in scikit-learn already support multiclass classification by default via the one-versus-rest (OvR) method, which allows us to feed the three flower classes to the perceptron all at once. The code is as follows:
>>> from sklearn.linear_model import Perceptron
>>> ppn = Perceptron(eta0=0.1, random_state=1)
>>> ppn.fit(X_train_std, y_train)
The scikit-learn interface will remind you of our perceptron implementation in Chapter 2. After loading the Perceptron
class from the linear_model
module, we initialized a new Perceptron
object and trained the model via the fit
method. Here, the model parameter, eta0
, is equivalent to the learning rate, eta
, that we used in our own perceptron implementation.
As you will remember from Chapter 2, finding an appropriate learning rate requires some experimentation. If the learning rate is too large, the algorithm will overshoot the global loss minimum. If the learning rate is too small, the algorithm will require more epochs until convergence, which can make the learning slow—especially for large datasets. Also, we used the random_state
parameter to ensure the reproducibility of the initial shuffling of the training dataset after each epoch.
Having trained a model in scikit-learn, we can make predictions via the predict
method, just like in our own perceptron implementation in Chapter 2. The code is as follows:
>>> y_pred = ppn.predict(X_test_std)
>>> print('Misclassified examples: %d' % (y_test != y_pred).sum())
Misclassified examples: 1
Executing the code, we can see that the perceptron misclassifies 1 out of the 45 flower examples. Thus, the misclassification error on the test dataset is approximately 0.022, or 2.2 percent ().
Classification error versus accuracy
Instead of the misclassification error, many machine learning practitioners report the classification accuracy of a model, which is simply calculated as follows:
1–error = 0.978, or 97.8 percent
Whether we use the classification error or accuracy is merely a matter of preference.
Note that scikit-learn also implements a large variety of different performance metrics that are available via the metrics
module. For example, we can calculate the classification accuracy of the perceptron on the test dataset as follows:
>>> from sklearn.metrics import accuracy_score
>>> print('Accuracy: %.3f' % accuracy_score(y_test, y_pred))
Accuracy: 0.978
Here, y_test
is the true class labels and y_pred
is the class labels that we predicted previously. Alternatively, each classifier in scikit-learn has a score
method, which computes a classifier’s prediction accuracy by combining the predict
call with accuracy_score
, as shown here:
>>> print('Accuracy: %.3f' % ppn.score(X_test_std, y_test))
Accuracy: 0.978
Overfitting
Note that we will evaluate the performance of our models based on the test dataset in this chapter. In Chapter 6, you will learn about useful techniques, including graphical analysis, such as learning curves, to detect and prevent overfitting. Overfitting, which we will return to later in this chapter, means that the model captures the patterns in the training data well but fails to generalize well to unseen data.
Finally, we can use our plot_decision_regions
function from Chapter 2 to plot the decision regions of our newly trained perceptron model and visualize how well it separates the different flower examples. However, let’s add a small modification to highlight the data instances from the test dataset via small circles:
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
def plot_decision_regions(X, y, classifier, test_idx=None,
resolution=0.02):
# setup marker generator and color map
markers = ('o', 's', '^', 'v', '<')
colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
cmap = ListedColormap(colors[:len(np.unique(y))])
# plot the decision surface
x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
np.arange(x2_min, x2_max, resolution))
lab = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
lab = lab.reshape(xx1.shape)
plt.contourf(xx1, xx2, lab, alpha=0.3, cmap=cmap)
plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())
# plot class examples
for idx, cl in enumerate(np.unique(y)):
plt.scatter(x=X[y == cl, 0],
y=X[y == cl, 1],
alpha=0.8,
c=colors[idx],
marker=markers[idx],
label=f'Class {cl}',
edgecolor='black')
# highlight test examples
if test_idx:
# plot all examples
X_test, y_test = X[test_idx, :], y[test_idx]
plt.scatter(X_test[:, 0], X_test[:, 1],
c='none', edgecolor='black', alpha=1.0,
linewidth=1, marker='o',
s=100, label='Test set')
With the slight modification that we made to the plot_decision_regions
function, we can now specify the indices of the examples that we want to mark on the resulting plots. The code is as follows:
>>> X_combined_std = np.vstack((X_train_std, X_test_std))
>>> y_combined = np.hstack((y_train, y_test))
>>> plot_decision_regions(X=X_combined_std,
... y=y_combined,
... classifier=ppn,
... test_idx=range(105, 150))
>>> plt.xlabel('Petal length [standardized]')
>>> plt.ylabel('Petal width [standardized]')
>>> plt.legend(loc='upper left')
>>> plt.tight_layout()
>>> plt.show()
As we can see in the resulting plot, the three flower classes can’t be perfectly separated by a linear decision boundary:
Figure 3.1: Decision boundaries of a multi-class perceptron model fitted to the Iris dataset
However, remember from our discussion in Chapter 2 that the perceptron algorithm never converges on datasets that aren’t perfectly linearly separable, which is why the use of the perceptron algorithm is typically not recommended in practice. In the following sections, we will look at more powerful linear classifiers that converge to a loss minimum even if the classes are not perfectly linearly separable.
Additional perceptron settings
The Perceptron
, as well as other scikit-learn functions and classes, often has additional parameters that we omit for clarity. You can read more about those parameters using the help
function in Python (for instance, help(Perceptron)
) or by going through the excellent scikit-learn online documentation at http://scikit-learn.org/stable/.