Chapter 12. Training Artificial Neural Networks for Image Recognition
As you may know, deep learning is getting a lot of press and is without any doubt the hottest topic in the machine learning field. Deep learning can be understood as a set of algorithms that were developed to train artificial neural networks with many layers most efficiently. In this chapter, you will learn the basic concepts of artificial neural networks so that you will be well equipped to further explore the most exciting areas of research in the machine learning field, as well as the advanced Python-based deep learning libraries that are currently being developed.
The topics that we will cover are as follows:
- Getting a conceptual understanding of multi-layer neural networks
- Training neural networks for image classification
- Implementing the powerful backpropagation algorithm
- Debugging neural network implementations
Modeling complex functions with artificial neural networks
At the beginning of this book, we started our journey through machine learning algorithms with artificial neurons in Chapter 2, Training Machine Learning Algorithms for Classification. Artificial neurons represent the building blocks of the multi-layer artificial neural networks that we are going to discuss in this chapter. The basic concept behind artificial neural networks was built upon hypotheses and models of how the human brain works to solve complex problem tasks. Although artificial neural networks have gained a lot of popularity in recent years, early studies of neural networks go back to the 1940s when Warren McCulloch and Walter Pitt first described how neurons could work. However, in the decades that followed the first implementation of the McCulloch-Pitt neuron model, Rosenblatt's perceptron in the 1950s, many researchers and machine learning practitioners slowly began to lose interest in neural networks since no one had a good solution for training a neural network with multiple layers. Eventually, interest in neural networks was rekindled in 1986 when D.E. Rumelhart, G.E. Hinton, and R.J. Williams were involved in the (re)discovery and popularization of the backpropagation algorithm to train neural networks more efficiently, which we will discuss in more detail later in this chapter (Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (1986). Learning Representations by Back-propagating Errors. Nature 323 (6088): 533–536).
During the previous decade, many more major breakthroughs resulted in what we now call deep learning algorithms, which can be used to create feature detectors from unlabeled data to pre-train deep neural networks—neural networks that are composed of many layers. Neural networks are a hot topic not only in academic research, but also in big technology companies such as Facebook, Microsoft, and Google who invest heavily in artificial neural networks and deep learning research. As of today, complex neural networks powered by deep learning algorithms are considered as state-of-the-art when it comes to complex problem solving such as image and voice recognition. Popular examples of the products in our everyday life that are powered by deep learning are Google's image search and Google Translate, an application for smartphones that can automatically recognize text in images for real-time translation into 20 languages (http://googleresearch.blogspot.com/2015/07/how-google-translate-squeezes-deep.html).
Many more exciting applications of deep neural networks are under active development at major tech companies, for example, Facebook's DeepFace for tagging images (Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. DeepFace: Closing the gap to human-level performance in face verification. In Computer Vision and Pattern Recognition CVPR, 2014 IEEE Conference, pages 1701–1708) and Baidu's DeepSpeech, which is able to handle voice queries in Mandarin (A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al. DeepSpeech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014). In addition, the pharmaceutical industry recently started to use deep learning techniques for drug discovery and toxicity prediction, and research has shown that these novel techniques substantially exceed the performance of traditional methods for virtual screening (T. Unterthiner, A. Mayr, G. Klambauer, and S. Hochreiter. Toxicity prediction using deep learning. arXiv preprint arXiv:1503.01445, 2015).
Single-layer neural network recap
This chapter is all about multi-layer neural networks, how they work, and how to train them to solve complex problems. However, before we dig deeper into a particular multi-layer neural network architecture, let's briefly reiterate some of the concepts of single-layer neural networks that we introduced in Chapter 2, Training Machine Learning Algorithms for Classification, namely, the ADAptive LInear NEuron (Adaline) algorithm that is shown in the following figure:
In Chapter 2, Training Machine Learning Algorithms for Classification, we implemented the Adaline algorithm to perform binary classification, and we used a gradient descent optimization algorithm to learn the weight coefficients of the model. In every epoch (pass over the training set), we updated the weight vector using the following update rule:
In other words, we computed the gradient based on the whole training set and updated the weights of the model by taking a step into the opposite direction of the gradient . In order to find the optimal weights of the model, we optimized an objective function that we defined as the Sum of Squared Errors (SSE) cost function . Furthermore, we multiplied the gradient by a factor, the learning rate , which we chose carefully to balance the speed of learning against the risk of overshooting the global minimum of the cost function.
In gradient descent optimization, we updated all weights simultaneously after each epoch, and we defined the partial derivative for each weight in the weight vector as follows:
Here is the target class label of a particular sample , and is the activation of the neuron, which is a linear function in the special case of Adaline. Furthermore, we defined the activation function as follows:
Here, the net input is a linear combination of the weights that are connecting the input to the output layer:
While we used the activation to compute the gradient update, we implemented a threshold function (Heaviside function) to squash the continuous-valued output into binary class labels for prediction:
Note
Note that although Adaline consists of two layers, one input layer and one output layer, it is called a single-layer network because of its single link between the input and output layers.
Introducing the multi-layer neural network architecture
In this section, we will see how to connect multiple single neurons to a multi-layer feedforward neural network; this special type of network is also called a multi-layer perceptron (MLP). The following figure explains the concept of an MLP consisting of three layers: one input layer, one hidden layer, and one output layer. The units in the hidden layer are fully connected to the input layer, and the output layer is fully connected to the hidden layer, respectively. If such a network has more than one hidden layer, we also call it a deep artificial neural network.
Note
We could add an arbitrary number of hidden layers to the MLP to create deeper network architectures. Practically, we can think of the number of layers and units in a neural network as additional hyperparameters that we want to optimize for a given problem task using the cross-validation that we discussed in Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning.
However, the error gradients that we will calculate later via backpropagation would become increasingly small as more layers are added to a network. This vanishing gradient problem makes the model learning more challenging. Therefore, special algorithms have been developed to pretrain such deep neural network structures, which is called deep learning.
As shown in the preceding figure, we denote the th activation unit in the th layer as , and the activation units and are the bias units, respectively, which we set equal to 1. The activation of the units in the input layer is just its input plus the bias unit:
Each unit in layer is connected to all units in layer via a weight coefficient. For example, the connection between the th unit in layer to the th unit in layer would be written as . Please note that the superscript in stands for the th sample, not the th layer. In the following paragraphs, we will often omit the superscript for clarity.
While one unit in the output layer would suffice for a binary classification task, we saw a more general form of a neural network in the preceding figure, which allows us to perform multi-class classification via a generalization of the
One-vs-All (OvA) technique. To better understand how this works, remember the one-hot representation of categorical variables that we introduced in Chapter 4, Building Good Training Sets – Data Preprocessing. For example, we would encode the three class labels in the familiar Iris dataset (0=Setosa, 1=Versicolor, 2=Virginica
) as follows:
This one-hot vector representation allows us to tackle classification tasks with an arbitrary number of unique class labels present in the training set.
If you are new to neural network representations, the terminology around the indices (subscripts and superscripts) may look a little bit confusing at first. You may wonder why we wrote and not to refer to the weight coefficient that connects the
th unit in layer to the th unit in layer . What may seem a little bit quirky at first will make much more sense in later sections when we vectorize the neural network representation. For example, we will summarize the weights that connect the input and hidden layer by a matrix , where is the number of hidden units and is the number of hidden units plus bias unit. Since it is important to internalize this notation to follow the concepts later in this chapter, let's summarize what we just discussed in a descriptive illustration of a simplified 3-4-3 multi-layer perceptron:
Activating a neural network via forward propagation
In this section, we will describe the process of forward propagation to calculate the output of an MLP model. To understand how it fits into the context of learning an MLP model, let's summarize the MLP learning procedure in three simple steps:
- Starting at the input layer, we forward propagate the patterns of the training data through the network to generate an output.
- Based on the network's output, we calculate the error that we want to minimize using a cost function that we will describe later.
- We backpropagate the error, find its derivative with respect to each weight in the network, and update the model.
Finally, after repeating the steps for multiple epochs and learning the weights of the MLP, we use forward propagation to calculate the network output and apply a threshold function to obtain the predicted class labels in the one-hot representation, which we described in the previous section.
Now, let's walk through the individual steps of forward propagation to generate an output from the patterns in the training data. Since each unit in the hidden unit is connected to all units in the input layers, we first calculate the activation as follows:
Here, is the net input and is the activation function, which has to be differentiable to learn the weights that connect the neurons using a gradient-based approach. To be able to solve complex problems such as image classification, we need nonlinear activation functions in our MLP model, for example, the sigmoid (logistic) activation function that we used in logistic regression in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn:
As we can remember, the sigmoid function is an S-shaped curve that maps the net input onto a logistic distribution in the range 0 to 1, which passes the origin at z = 0.5, as shown in the following graph:
The MLP is a typical example of a feedforward artificial neural network. The term feedforward refers to the fact that each layer serves as the input to the next layer without loops, in contrast to recurrent neural networks, an architecture that we will discuss later in this chapter. The term multi-layer perceptron may sound a little bit confusing, since the artificial neurons in this network architecture are typically sigmoid units, not perceptrons. Intuitively, we can think of the neurons in the MLP as logistic regression units that return values in the continuous range between 0 and 1.
For purposes of code efficiency and readability, we will now write the activation in a more compact form using the concepts of basic linear algebra, which will allow us to vectorize our code implementation via NumPy rather than writing multiple nested and expensive Python for
loops:
Here, is our dimensional feature vector of a sample plus bias unit. is an dimensional weight matrix where is the number of hidden units in our neural network. After matrix-vector multiplication, we obtain the dimensional net input vector to calculate the activation (where ). Furthermore, we can generalize this computation to all samples in the training set:
Here, is now an matrix, and the matrix-matrix multiplication will result in a dimensional net input matrix . Finally, we apply the activation function to each value in the net input matrix to get the activation matrix for the next layer (here, output layer):
Similarly, we can rewrite the activation of the output layer in the vectorized form:
Here, we multiply the matrix (t is the number of output units) by the dimensional matrix to obtain the dimensional matrix (the columns in this matrix represent the outputs for each sample).
Lastly, we apply the sigmoid activation function to obtain the continuous valued output of our network:
Classifying handwritten digits
In the previous section, we covered a lot of the theory around neural networks, which can be a little bit overwhelming if you are new to this topic. Before we continue with the discussion of the algorithm for learning the weights of the MLP model, backpropagation, let's take a short break from the theory and see a neural network in action.
Note
Neural network theory can be quite complex, thus I want to recommend two additional resources that cover some of the concepts that we discuss in this chapter in more detail:
T. Hastie, J. Friedman, and R. Tibshirani. The Elements of Statistical Learning, Volume 2. Springer, 2009.
C. M. Bishop et al. Pattern Recognition and Machine Learning, Volume 1. Springer New York, 2006.
In this section, we will train our first multi-layer neural network to classify handwritten digits from the popular MNIST dataset (short for Mixed National Institute of Standards and Technology database) that has been constructed by Yann LeCun et al. and serves as a popular benchmark dataset for machine learning algorithms (Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11):2278-2324, November 1998).
Obtaining the MNIST dataset
The MNIST dataset is publicly available at http://yann.lecun.com/exdb/mnist/ and consists of the following four parts:
- Training set images:
train-images-idx3-ubyte.gz
(9.9 MB, 47 MB unzipped, and 60,000 samples) - Training set labels:
train-labels-idx1-ubyte.gz
(29 KB, 60 KB unzipped, and 60,000 labels) - Test set images:
t10k-images-idx3-ubyte.gz
(1.6 MB, 7.8 MB, unzipped and 10,000 samples) - Test set labels:
t10k-labels-idx1-ubyte.gz
(5 KB, 10 KB unzipped, and 10,000 labels)
The MNIST dataset was constructed from two datasets of the US National Institute of Standards and Technology (NIST). The training set consists of handwritten digits from 250 different people, 50 percent high school students, and 50 percent employees from the Census Bureau. Note that the test set contains handwritten digits from different people following the same split.
After downloading the files, I recommend unzipping the files using the Unix/Linux gzip tool from the command line terminal for efficiency using the following command in your local MNIST download directory:
gzip *ubyte.gz -d
Alternatively, you could use your favorite unzipping tool if you are working with a machine running on Microsoft Windows. The images are stored in byte format, and we will read them into NumPy arrays that we will use to train and test our MLP implementation:
import os import struct import numpy as np def load_mnist(path, kind='train'): """Load MNIST data from `path`""" labels_path = os.path.join(path, '%s-labels-idx1-ubyte' % kind) images_path = os.path.join(path, '%s-images-idx3-ubyte' % kind) with open(labels_path, 'rb') as lbpath: magic, n = struct.unpack('>II', lbpath.read(8)) labels = np.fromfile(lbpath, dtype=np.uint8) with open(images_path, 'rb') as imgpath: magic, num, rows, cols = struct.unpack(">IIII", imgpath.read(16)) images = np.fromfile(imgpath, dtype=np.uint8).reshape(len(labels), 784) return images, labels
The load_mnist
function returns two arrays, the first being an dimensional NumPy array (images
), where is the number of samples and is the number of features. The training dataset consists of 60,000 training digits and the test set contains 10,000 samples, respectively. The images in the MNIST dataset consist of pixels, and each pixel is represented by a gray scale intensity value. Here, we unroll the pixels into 1D row vectors, which represent the rows in our image array (784 per row or image). The second array (labels
) returned by the load_mnist
function contains the corresponding target variable, the class labels (integers 0-9) of the handwritten digits.
The way we read in the image might seem a little bit strange at first:
magic, n = struct.unpack('>II', lbpath.read(8)) labels = np.fromfile(lbpath, dtype=np.int8)
To understand how these two lines of code work, let's take a look at the dataset description from the MNIST website:
Using the two lines of the preceding code, we first read in the magic number, which is a description of the file protocol as well as the number of items (n) from the file buffer before we read the following bytes into a NumPy array using the fromfile
method. The fmt
parameter value >II
that we passed as an argument to struct.unpack
has two parts:
>
: This is the big-endian (defines the order in which a sequence of bytes is stored); if you are unfamiliar with the terms big-endian and small-endian, you can find an excellent article about Endianness on Wikipedia (https://en.wikipedia.org/wiki/Endianness).I
: This is an unsigned integer.
By executing the following code, we will now load the 60,000 training instances as well as the 10,000 test samples from the mnist
directory where we unzipped the MNIST dataset:
>>> X_train, y_train = load_mnist('mnist', kind='train') >>> print('Rows: %d, columns: %d' ... % (X_train.shape[0], X_train.shape[1])) Rows: 60000, columns: 784 >>> X_test, y_test = load_mnist('mnist', kind='t10k') >>> print('Rows: %d, columns: %d' ... % (X_test.shape[0], X_test.shape[1])) Rows: 10000, columns: 784
To get a idea what the images in MNIST look like, let's visualize examples of the digits 0-9 after reshaping the 784-pixel vectors from our feature matrix into the original 28 × 28 image that we can plot via matplotlib's imshow
function:
>>> import matplotlib.pyplot as plt >>> fig, ax = plt.subplots(nrows=2, ncols=5, sharex=True, sharey=True,) >>> ax = ax.flatten() >>> for i in range(10): ... img = X_train[y_train == i][0].reshape(28, 28) ... ax[i].imshow(img, cmap='Greys', interpolation='nearest') >>> ax[0].set_xticks([]) >>> ax[0].set_yticks([]) >>> plt.tight_layout() >>> plt.show()
We should now see a plot of the subfigures showing a representative image of each unique digit:
In addition, let's also plot multiple examples of the same digit to see how different those handwriting examples really are:
>>> fig, ax = plt.subplots(nrows=5, ... ncols=5, ... sharex=True, ... sharey=True,) >>> ax = ax.flatten() >>> for i in range(25): ... img = X_train[y_train == 7][i].reshape(28, 28) ... ax[i].imshow(img, cmap='Greys', interpolation='nearest') >>> ax[0].set_xticks([]) >>> ax[0].set_yticks([]) >>> plt.tight_layout() >>> plt.show()
After executing the code, we should now see the first 25 variants of the digit 7.
Optionally, we can save the MNIST image data and labels as CSV files to open them in programs that do not support their special byte format. However, we should be aware that the CSV file format will take up substantially more space on your local drive, as listed here:
train_img.csv
: 109.5 MBtrain_labels.csv
: 120 KBtest_img.csv
: 18.3 MBtest_labels
: 20 KB
If we decide to save those CSV files, we can execute the following code in our Python session after loading the MNIST data into NumPy arrays:
>>> np.savetxt('train_img.csv', X_train, ... fmt='%i', delimiter=',') >>> np.savetxt('train_labels.csv', y_train, ... fmt='%i', delimiter=',') >>> np.savetxt('test_img.csv', X_test, ... fmt='%i', delimiter=',') >>> np.savetxt('test_labels.csv', y_test, ... fmt='%i', delimiter=',')
Once we have saved the CSV files, we can load them back into Python using NumPy's genfromtxt
function:
>>> X_train = np.genfromtxt('train_img.csv', ... dtype=int, delimiter=',') >>> y_train = np.genfromtxt('train_labels.csv', ... dtype=int, delimiter=',') >>> X_test = np.genfromtxt('test_img.csv', ... dtype=int, delimiter=',') >>> y_test = np.genfromtxt('test_labels.csv', ... dtype=int, delimiter=',')
However, it will take substantially longer to load the MNIST data from the CSV files, thus I recommend you stick to the original byte format if possible.
Implementing a multi-layer perceptron
In this subsection, we will now implement the code of an MLP with one input, one hidden, and one output layer to classify the images in the MNIST dataset. I have tried to keep the code as simple as possible. However, it may seem a little bit complicated at first, and I encourage you to download the sample code for this chapter from the Packt Publishing website, where you can find this MLP implementation annotated with comments and syntax highlighting for better readability. If you are not running the code from the accompanying IPython notebook, I recommend you copy it into a Python script file in your current working directory, for example, neuralnet.py
, which you can then import into your current Python session via the following command:
from neuralnet import NeuralNetMLP
The code will contain parts that we have not talked about yet, such as the backpropagation algorithm, but most of the code should look familiar to you based on the Adaline implementation in Chapter 2, Training Machine Learning Algorithms for Classification, and the discussion of forward propagation in earlier sections. Do not worry if not all of the code makes immediate sense to you; we will follow up on certain parts later in this chapter. However, going over the code at this stage can make it easier to follow the theory later.
import numpy as np from scipy.special import expit import sys class NeuralNetMLP(object): def __init__(self, n_output, n_features, n_hidden=30, l1=0.0, l2=0.0, epochs=500, eta=0.001, alpha=0.0, decrease_const=0.0, shuffle=True, minibatches=1, random_state=None): np.random.seed(random_state) self.n_output = n_output self.n_features = n_features self.n_hidden = n_hidden self.w1, self.w2 = self._initialize_weights() self.l1 = l1 self.l2 = l2 self.epochs = epochs self.eta = eta self.alpha = alpha self.decrease_const = decrease_const self.shuffle = shuffle self.minibatches = minibatches def _encode_labels(self, y, k): onehot = np.zeros((k, y.shape[0])) for idx, val in enumerate(y): onehot[val, idx] = 1.0 return onehot def _initialize_weights(self): w1 = np.random.uniform(-1.0, 1.0, size=self.n_hidden*(self.n_features + 1)) w1 = w1.reshape(self.n_hidden, self.n_features + 1) w2 = np.random.uniform(-1.0, 1.0, size=self.n_output*(self.n_hidden + 1)) w2 = w2.reshape(self.n_output, self.n_hidden + 1) return w1, w2 def _sigmoid(self, z): # expit is equivalent to 1.0/(1.0 + np.exp(-z)) return expit(z) def _sigmoid_gradient(self, z): sg = self._sigmoid(z) return sg * (1 - sg) def _add_bias_unit(self, X, how='column'): if how == 'column': X_new = np.ones((X.shape[0], X.shape[1]+1)) X_new[:, 1:] = X elif how == 'row': X_new = np.ones((X.shape[0]+1, X.shape[1])) X_new[1:, :] = X else: raise AttributeError('`how` must be `column` or `row`') return X_new def _feedforward(self, X, w1, w2): a1 = self._add_bias_unit(X, how='column') z2 = w1.dot(a1.T) a2 = self._sigmoid(z2) a2 = self._add_bias_unit(a2, how='row') z3 = w2.dot(a2) a3 = self._sigmoid(z3) return a1, z2, a2, z3, a3 def _L2_reg(self, lambda_, w1, w2): return (lambda_/2.0) * (np.sum(w1[:, 1:] ** 2)\ + np.sum(w2[:, 1:] ** 2)) def _L1_reg(self, lambda_, w1, w2): return (lambda_/2.0) * (np.abs(w1[:, 1:]).sum()\ + np.abs(w2[:, 1:]).sum()) def _get_cost(self, y_enc, output, w1, w2): term1 = -y_enc * (np.log(output)) term2 = (1 - y_enc) * np.log(1 - output) cost = np.sum(term1 - term2) L1_term = self._L1_reg(self.l1, w1, w2) L2_term = self._L2_reg(self.l2, w1, w2) cost = cost + L1_term + L2_term return cost def _get_gradient(self, a1, a2, a3, z2, y_enc, w1, w2): # backpropagation sigma3 = a3 - y_enc z2 = self._add_bias_unit(z2, how='row') sigma2 = w2.T.dot(sigma3) * self._sigmoid_gradient(z2) sigma2 = sigma2[1:, :] grad1 = sigma2.dot(a1) grad2 = sigma3.dot(a2.T) # regularize grad1[:, 1:] += (w1[:, 1:] * (self.l1 + self.l2)) grad2[:, 1:] += (w2[:, 1:] * (self.l1 + self.l2)) return grad1, grad2 def predict(self, X): a1, z2, a2, z3, a3 = self._feedforward(X, self.w1, self.w2) y_pred = np.argmax(z3, axis=0) return y_pred def fit(self, X, y, print_progress=False): self.cost_ = [] X_data, y_data = X.copy(), y.copy() y_enc = self._encode_labels(y, self.n_output) delta_w1_prev = np.zeros(self.w1.shape) delta_w2_prev = np.zeros(self.w2.shape) for i in range(self.epochs): # adaptive learning rate self.eta /= (1 + self.decrease_const*i) if print_progress: sys.stderr.write( '\rEpoch: %d/%d' % (i+1, self.epochs)) sys.stderr.flush() if self.shuffle: idx = np.random.permutation(y_data.shape[0]) X_data, y_data = X_data[idx], y_data[idx] mini = np.array_split(range( y_data.shape[0]), self.minibatches) for idx in mini: # feedforward a1, z2, a2, z3, a3 = self._feedforward( X[idx], self.w1, self.w2) cost = self._get_cost(y_enc=y_enc[:, idx], output=a3, w1=self.w1, w2=self.w2) self.cost_.append(cost) # compute gradient via backpropagation grad1, grad2 = self._get_gradient(a1=a1, a2=a2, a3=a3, z2=z2, y_enc=y_enc[:, idx], w1=self.w1, w2=self.w2) # update weights delta_w1, delta_w2 = self.eta * grad1,\ self.eta * grad2 self.w1 -= (delta_w1 + (self.alpha * delta_w1_prev)) self.w2 -= (delta_w2 + (self.alpha * delta_w2_prev)) delta_w1_prev, delta_w2_prev = delta_w1, delta_w2 return self
Now, let's initialize a new 784-50-10 MLP, a neural network with 784 input units (n_features
), 50 hidden units (n_hidden
), and 10 output units (n_output
):
>>> nn = NeuralNetMLP(n_output=10, ... n_features=X_train.shape[1], ... n_hidden=50, ... l2=0.1, ... l1=0.0, ... epochs=1000, ... eta=0.001, ... alpha=0.001, ... decrease_const=0.00001, ... shuffle=True, ... minibatches=50, ... random_state=1)
As you may have noticed, by going over our preceding MLP implementation, we also implemented some additional features, which are summarized here:
l2
: The parameter for L2 regularization to decrease the degree of overfitting; equivalently,l1
is the parameter for L1 regularization.epochs
: The number of passes over the training set.eta
: The learning rate .alpha
: A parameter for momentum learning to add a factor of the previous gradient to the weight update for faster learning (where is the current time step or epoch).decrease_const
: The decrease constant for an adaptive learning rate that decreases over time for better convergence .shuffle
: Shuffling the training set prior to every epoch to prevent the algorithm from getting stuck in cycles.Minibatches
: Splitting of the training data into k mini-batches in each epoch. The gradient is computed for each mini-batch separately instead of the entire training data for faster learning.
Next, we train the MLP using 60,000 samples from the already shuffled MNIST training dataset. Before you execute the following code, please note that training the neural network may take 10-30 minutes on standard desktop computer hardware:
>>> nn.fit(X_train, y_train, print_progress=True) Epoch: 1000/1000
Similar to our previous Adaline implementation, we save the cost for each epoch in a cost_
list that we can now visualize, making sure that the optimization algorithm reached convergence. Here, we only plot every 50th step to account for the 50 mini-batches (50 mini-batches × 1000 epochs). The code is as follows:
>>> plt.plot(range(len(nn.cost_)), nn.cost_) >>> plt.ylim([0, 2000]) >>> plt.ylabel('Cost') >>> plt.xlabel('Epochs * 50') >>> plt.tight_layout() >>> plt.show()
As we see in the following plot, the graph of the cost function looks very noisy. This is due to the fact that we trained our neural network with mini-batch learning, a variant of stochastic gradient descent.
Although we can already see in the plot that the optimization algorithm converged after approximately 800 epochs (40,000/50 = 800), let's plot a smoother version of the cost function against the number of epochs by averaging over the mini-batch intervals. The code is as follows:
>>> batches = np.array_split(range(len(nn.cost_)), 1000) >>> cost_ary = np.array(nn.cost_) >>> cost_avgs = [np.mean(cost_ary[i]) for i in batches] >>> plt.plot(range(len(cost_avgs)), ... cost_avgs, ... color='red') >>> plt.ylim([0, 2000]) >>> plt.ylabel('Cost') >>> plt.xlabel('Epochs') >>> plt.tight_layout() >>> plt.show()
The following plot gives us a clearer picture indicating that the training algorithm converged shortly after the 800th epoch:
Now, let's evaluate the performance of the model by calculating the prediction accuracy:
>>> y_train_pred = nn.predict(X_train) >>> acc = np.sum(y_train == y_train_pred, axis=0) / X_train.shape[0] >>> print('Training accuracy: %.2f%%' % (acc * 100)) Training accuracy: 97.74%
As we can see, the model classifies most of the training digits correctly, but how does it generalize to data that it has not seen before? Let's calculate the accuracy on 10,000 images in the test dataset:
>>> y_test_pred = nn.predict(X_test) >>> acc = np.sum(y_test == y_test_pred, axis=0) / X_test.shape[0] >>> print('Training accuracy: %.2f%%' % (acc * 100)) Test accuracy: 96.18%
Based on the small discrepancy between training and test accuracy, we can conclude that the model only slightly overfits the training data. To further fine-tune the model, we could change the number of hidden units, values of the regularization parameters, learning rate, values of the decrease constant, or the adaptive learning using the techniques that we discussed in Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning (this is left as an exercise for the reader).
Now, let's take a look at some of the images that our MLP struggles with:
>>> miscl_img = X_test[y_test != y_test_pred][:25] >>> correct_lab = y_test[y_test != y_test_pred][:25] >>> miscl_lab= y_test_pred[y_test != y_test_pred][:25] >>> fig, ax = plt.subplots(nrows=5, ... ncols=5, ... sharex=True, ... sharey=True,) >>> ax = ax.flatten() >>> for i in range(25): ... img = miscl_img[i].reshape(28, 28) ... ax[i].imshow(img, ... cmap='Greys', ... interpolation='nearest') ... ax[i].set_title('%d) t: %d p: %d' ... % (i+1, correct_lab[i], miscl_lab[i])) >>> ax[0].set_xticks([]) >>> ax[0].set_yticks([]) >>> plt.tight_layout() >>> plt.show()
We should now see a subplot matrix where the first number in the subtitles indicates the plot index, the second number indicates the true class label (t), and the third number stands for the predicted class label (p).
As we can see in the preceding figure, some of those images are even challenging for us humans to classify correctly. For example, we can see that the digit 9 is classified as a 3 or 8 if the lower part of the digit has a hook-like curvature (subplots 3, 16, and 17).
Obtaining the MNIST dataset
The MNIST dataset is publicly available at http://yann.lecun.com/exdb/mnist/ and consists of the following four parts:
- Training set images:
train-images-idx3-ubyte.gz
(9.9 MB, 47 MB unzipped, and 60,000 samples) - Training set labels:
train-labels-idx1-ubyte.gz
(29 KB, 60 KB unzipped, and 60,000 labels) - Test set images:
t10k-images-idx3-ubyte.gz
(1.6 MB, 7.8 MB, unzipped and 10,000 samples) - Test set labels:
t10k-labels-idx1-ubyte.gz
(5 KB, 10 KB unzipped, and 10,000 labels)
The MNIST dataset was constructed from two datasets of the US National Institute of Standards and Technology (NIST). The training set consists of handwritten digits from 250 different people, 50 percent high school students, and 50 percent employees from the Census Bureau. Note that the test set contains handwritten digits from different people following the same split.
After downloading the files, I recommend unzipping the files using the Unix/Linux gzip tool from the command line terminal for efficiency using the following command in your local MNIST download directory:
gzip *ubyte.gz -d
Alternatively, you could use your favorite unzipping tool if you are working with a machine running on Microsoft Windows. The images are stored in byte format, and we will read them into NumPy arrays that we will use to train and test our MLP implementation:
import os import struct import numpy as np def load_mnist(path, kind='train'): """Load MNIST data from `path`""" labels_path = os.path.join(path, '%s-labels-idx1-ubyte' % kind) images_path = os.path.join(path, '%s-images-idx3-ubyte' % kind) with open(labels_path, 'rb') as lbpath: magic, n = struct.unpack('>II', lbpath.read(8)) labels = np.fromfile(lbpath, dtype=np.uint8) with open(images_path, 'rb') as imgpath: magic, num, rows, cols = struct.unpack(">IIII", imgpath.read(16)) images = np.fromfile(imgpath, dtype=np.uint8).reshape(len(labels), 784) return images, labels
The load_mnist
function returns two arrays, the first being an dimensional NumPy array (images
), where is the number of samples and is the number of features. The training dataset consists of 60,000 training digits and the test set contains 10,000 samples, respectively. The images in the MNIST dataset consist of pixels, and each pixel is represented by a gray scale intensity value. Here, we unroll the pixels into 1D row vectors, which represent the rows in our image array (784 per row or image). The second array (labels
) returned by the load_mnist
function contains the corresponding target variable, the class labels (integers 0-9) of the handwritten digits.
The way we read in the image might seem a little bit strange at first:
magic, n = struct.unpack('>II', lbpath.read(8)) labels = np.fromfile(lbpath, dtype=np.int8)
To understand how these two lines of code work, let's take a look at the dataset description from the MNIST website:
Using the two lines of the preceding code, we first read in the magic number, which is a description of the file protocol as well as the number of items (n) from the file buffer before we read the following bytes into a NumPy array using the fromfile
method. The fmt
parameter value >II
that we passed as an argument to struct.unpack
has two parts:
>
: This is the big-endian (defines the order in which a sequence of bytes is stored); if you are unfamiliar with the terms big-endian and small-endian, you can find an excellent article about Endianness on Wikipedia (https://en.wikipedia.org/wiki/Endianness).I
: This is an unsigned integer.
By executing the following code, we will now load the 60,000 training instances as well as the 10,000 test samples from the mnist
directory where we unzipped the MNIST dataset:
>>> X_train, y_train = load_mnist('mnist', kind='train') >>> print('Rows: %d, columns: %d' ... % (X_train.shape[0], X_train.shape[1])) Rows: 60000, columns: 784 >>> X_test, y_test = load_mnist('mnist', kind='t10k') >>> print('Rows: %d, columns: %d' ... % (X_test.shape[0], X_test.shape[1])) Rows: 10000, columns: 784
To get a idea what the images in MNIST look like, let's visualize examples of the digits 0-9 after reshaping the 784-pixel vectors from our feature matrix into the original 28 × 28 image that we can plot via matplotlib's imshow
function:
>>> import matplotlib.pyplot as plt >>> fig, ax = plt.subplots(nrows=2, ncols=5, sharex=True, sharey=True,) >>> ax = ax.flatten() >>> for i in range(10): ... img = X_train[y_train == i][0].reshape(28, 28) ... ax[i].imshow(img, cmap='Greys', interpolation='nearest') >>> ax[0].set_xticks([]) >>> ax[0].set_yticks([]) >>> plt.tight_layout() >>> plt.show()
We should now see a plot of the subfigures showing a representative image of each unique digit:
In addition, let's also plot multiple examples of the same digit to see how different those handwriting examples really are:
>>> fig, ax = plt.subplots(nrows=5, ... ncols=5, ... sharex=True, ... sharey=True,) >>> ax = ax.flatten() >>> for i in range(25): ... img = X_train[y_train == 7][i].reshape(28, 28) ... ax[i].imshow(img, cmap='Greys', interpolation='nearest') >>> ax[0].set_xticks([]) >>> ax[0].set_yticks([]) >>> plt.tight_layout() >>> plt.show()
After executing the code, we should now see the first 25 variants of the digit 7.
Optionally, we can save the MNIST image data and labels as CSV files to open them in programs that do not support their special byte format. However, we should be aware that the CSV file format will take up substantially more space on your local drive, as listed here:
train_img.csv
: 109.5 MBtrain_labels.csv
: 120 KBtest_img.csv
: 18.3 MBtest_labels
: 20 KB
If we decide to save those CSV files, we can execute the following code in our Python session after loading the MNIST data into NumPy arrays:
>>> np.savetxt('train_img.csv', X_train, ... fmt='%i', delimiter=',') >>> np.savetxt('train_labels.csv', y_train, ... fmt='%i', delimiter=',') >>> np.savetxt('test_img.csv', X_test, ... fmt='%i', delimiter=',') >>> np.savetxt('test_labels.csv', y_test, ... fmt='%i', delimiter=',')
Once we have saved the CSV files, we can load them back into Python using NumPy's genfromtxt
function:
>>> X_train = np.genfromtxt('train_img.csv', ... dtype=int, delimiter=',') >>> y_train = np.genfromtxt('train_labels.csv', ... dtype=int, delimiter=',') >>> X_test = np.genfromtxt('test_img.csv', ... dtype=int, delimiter=',') >>> y_test = np.genfromtxt('test_labels.csv', ... dtype=int, delimiter=',')
However, it will take substantially longer to load the MNIST data from the CSV files, thus I recommend you stick to the original byte format if possible.
Implementing a multi-layer perceptron
In this subsection, we will now implement the code of an MLP with one input, one hidden, and one output layer to classify the images in the MNIST dataset. I have tried to keep the code as simple as possible. However, it may seem a little bit complicated at first, and I encourage you to download the sample code for this chapter from the Packt Publishing website, where you can find this MLP implementation annotated with comments and syntax highlighting for better readability. If you are not running the code from the accompanying IPython notebook, I recommend you copy it into a Python script file in your current working directory, for example, neuralnet.py
, which you can then import into your current Python session via the following command:
from neuralnet import NeuralNetMLP
The code will contain parts that we have not talked about yet, such as the backpropagation algorithm, but most of the code should look familiar to you based on the Adaline implementation in Chapter 2, Training Machine Learning Algorithms for Classification, and the discussion of forward propagation in earlier sections. Do not worry if not all of the code makes immediate sense to you; we will follow up on certain parts later in this chapter. However, going over the code at this stage can make it easier to follow the theory later.
import numpy as np from scipy.special import expit import sys class NeuralNetMLP(object): def __init__(self, n_output, n_features, n_hidden=30, l1=0.0, l2=0.0, epochs=500, eta=0.001, alpha=0.0, decrease_const=0.0, shuffle=True, minibatches=1, random_state=None): np.random.seed(random_state) self.n_output = n_output self.n_features = n_features self.n_hidden = n_hidden self.w1, self.w2 = self._initialize_weights() self.l1 = l1 self.l2 = l2 self.epochs = epochs self.eta = eta self.alpha = alpha self.decrease_const = decrease_const self.shuffle = shuffle self.minibatches = minibatches def _encode_labels(self, y, k): onehot = np.zeros((k, y.shape[0])) for idx, val in enumerate(y): onehot[val, idx] = 1.0 return onehot def _initialize_weights(self): w1 = np.random.uniform(-1.0, 1.0, size=self.n_hidden*(self.n_features + 1)) w1 = w1.reshape(self.n_hidden, self.n_features + 1) w2 = np.random.uniform(-1.0, 1.0, size=self.n_output*(self.n_hidden + 1)) w2 = w2.reshape(self.n_output, self.n_hidden + 1) return w1, w2 def _sigmoid(self, z): # expit is equivalent to 1.0/(1.0 + np.exp(-z)) return expit(z) def _sigmoid_gradient(self, z): sg = self._sigmoid(z) return sg * (1 - sg) def _add_bias_unit(self, X, how='column'): if how == 'column': X_new = np.ones((X.shape[0], X.shape[1]+1)) X_new[:, 1:] = X elif how == 'row': X_new = np.ones((X.shape[0]+1, X.shape[1])) X_new[1:, :] = X else: raise AttributeError('`how` must be `column` or `row`') return X_new def _feedforward(self, X, w1, w2): a1 = self._add_bias_unit(X, how='column') z2 = w1.dot(a1.T) a2 = self._sigmoid(z2) a2 = self._add_bias_unit(a2, how='row') z3 = w2.dot(a2) a3 = self._sigmoid(z3) return a1, z2, a2, z3, a3 def _L2_reg(self, lambda_, w1, w2): return (lambda_/2.0) * (np.sum(w1[:, 1:] ** 2)\ + np.sum(w2[:, 1:] ** 2)) def _L1_reg(self, lambda_, w1, w2): return (lambda_/2.0) * (np.abs(w1[:, 1:]).sum()\ + np.abs(w2[:, 1:]).sum()) def _get_cost(self, y_enc, output, w1, w2): term1 = -y_enc * (np.log(output)) term2 = (1 - y_enc) * np.log(1 - output) cost = np.sum(term1 - term2) L1_term = self._L1_reg(self.l1, w1, w2) L2_term = self._L2_reg(self.l2, w1, w2) cost = cost + L1_term + L2_term return cost def _get_gradient(self, a1, a2, a3, z2, y_enc, w1, w2): # backpropagation sigma3 = a3 - y_enc z2 = self._add_bias_unit(z2, how='row') sigma2 = w2.T.dot(sigma3) * self._sigmoid_gradient(z2) sigma2 = sigma2[1:, :] grad1 = sigma2.dot(a1) grad2 = sigma3.dot(a2.T) # regularize grad1[:, 1:] += (w1[:, 1:] * (self.l1 + self.l2)) grad2[:, 1:] += (w2[:, 1:] * (self.l1 + self.l2)) return grad1, grad2 def predict(self, X): a1, z2, a2, z3, a3 = self._feedforward(X, self.w1, self.w2) y_pred = np.argmax(z3, axis=0) return y_pred def fit(self, X, y, print_progress=False): self.cost_ = [] X_data, y_data = X.copy(), y.copy() y_enc = self._encode_labels(y, self.n_output) delta_w1_prev = np.zeros(self.w1.shape) delta_w2_prev = np.zeros(self.w2.shape) for i in range(self.epochs): # adaptive learning rate self.eta /= (1 + self.decrease_const*i) if print_progress: sys.stderr.write( '\rEpoch: %d/%d' % (i+1, self.epochs)) sys.stderr.flush() if self.shuffle: idx = np.random.permutation(y_data.shape[0]) X_data, y_data = X_data[idx], y_data[idx] mini = np.array_split(range( y_data.shape[0]), self.minibatches) for idx in mini: # feedforward a1, z2, a2, z3, a3 = self._feedforward( X[idx], self.w1, self.w2) cost = self._get_cost(y_enc=y_enc[:, idx], output=a3, w1=self.w1, w2=self.w2) self.cost_.append(cost) # compute gradient via backpropagation grad1, grad2 = self._get_gradient(a1=a1, a2=a2, a3=a3, z2=z2, y_enc=y_enc[:, idx], w1=self.w1, w2=self.w2) # update weights delta_w1, delta_w2 = self.eta * grad1,\ self.eta * grad2 self.w1 -= (delta_w1 + (self.alpha * delta_w1_prev)) self.w2 -= (delta_w2 + (self.alpha * delta_w2_prev)) delta_w1_prev, delta_w2_prev = delta_w1, delta_w2 return self
Now, let's initialize a new 784-50-10 MLP, a neural network with 784 input units (n_features
), 50 hidden units (n_hidden
), and 10 output units (n_output
):
>>> nn = NeuralNetMLP(n_output=10, ... n_features=X_train.shape[1], ... n_hidden=50, ... l2=0.1, ... l1=0.0, ... epochs=1000, ... eta=0.001, ... alpha=0.001, ... decrease_const=0.00001, ... shuffle=True, ... minibatches=50, ... random_state=1)
As you may have noticed, by going over our preceding MLP implementation, we also implemented some additional features, which are summarized here:
l2
: The parameter for L2 regularization to decrease the degree of overfitting; equivalently,l1
is the parameter for L1 regularization.epochs
: The number of passes over the training set.eta
: The learning rate .alpha
: A parameter for momentum learning to add a factor of the previous gradient to the weight update for faster learning (where is the current time step or epoch).decrease_const
: The decrease constant for an adaptive learning rate that decreases over time for better convergence .shuffle
: Shuffling the training set prior to every epoch to prevent the algorithm from getting stuck in cycles.Minibatches
: Splitting of the training data into k mini-batches in each epoch. The gradient is computed for each mini-batch separately instead of the entire training data for faster learning.
Next, we train the MLP using 60,000 samples from the already shuffled MNIST training dataset. Before you execute the following code, please note that training the neural network may take 10-30 minutes on standard desktop computer hardware:
>>> nn.fit(X_train, y_train, print_progress=True) Epoch: 1000/1000
Similar to our previous Adaline implementation, we save the cost for each epoch in a cost_
list that we can now visualize, making sure that the optimization algorithm reached convergence. Here, we only plot every 50th step to account for the 50 mini-batches (50 mini-batches × 1000 epochs). The code is as follows:
>>> plt.plot(range(len(nn.cost_)), nn.cost_) >>> plt.ylim([0, 2000]) >>> plt.ylabel('Cost') >>> plt.xlabel('Epochs * 50') >>> plt.tight_layout() >>> plt.show()
As we see in the following plot, the graph of the cost function looks very noisy. This is due to the fact that we trained our neural network with mini-batch learning, a variant of stochastic gradient descent.
Although we can already see in the plot that the optimization algorithm converged after approximately 800 epochs (40,000/50 = 800), let's plot a smoother version of the cost function against the number of epochs by averaging over the mini-batch intervals. The code is as follows:
>>> batches = np.array_split(range(len(nn.cost_)), 1000) >>> cost_ary = np.array(nn.cost_) >>> cost_avgs = [np.mean(cost_ary[i]) for i in batches] >>> plt.plot(range(len(cost_avgs)), ... cost_avgs, ... color='red') >>> plt.ylim([0, 2000]) >>> plt.ylabel('Cost') >>> plt.xlabel('Epochs') >>> plt.tight_layout() >>> plt.show()
The following plot gives us a clearer picture indicating that the training algorithm converged shortly after the 800th epoch:
Now, let's evaluate the performance of the model by calculating the prediction accuracy:
>>> y_train_pred = nn.predict(X_train) >>> acc = np.sum(y_train == y_train_pred, axis=0) / X_train.shape[0] >>> print('Training accuracy: %.2f%%' % (acc * 100)) Training accuracy: 97.74%
As we can see, the model classifies most of the training digits correctly, but how does it generalize to data that it has not seen before? Let's calculate the accuracy on 10,000 images in the test dataset:
>>> y_test_pred = nn.predict(X_test) >>> acc = np.sum(y_test == y_test_pred, axis=0) / X_test.shape[0] >>> print('Training accuracy: %.2f%%' % (acc * 100)) Test accuracy: 96.18%
Based on the small discrepancy between training and test accuracy, we can conclude that the model only slightly overfits the training data. To further fine-tune the model, we could change the number of hidden units, values of the regularization parameters, learning rate, values of the decrease constant, or the adaptive learning using the techniques that we discussed in Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning (this is left as an exercise for the reader).
Now, let's take a look at some of the images that our MLP struggles with:
>>> miscl_img = X_test[y_test != y_test_pred][:25] >>> correct_lab = y_test[y_test != y_test_pred][:25] >>> miscl_lab= y_test_pred[y_test != y_test_pred][:25] >>> fig, ax = plt.subplots(nrows=5, ... ncols=5, ... sharex=True, ... sharey=True,) >>> ax = ax.flatten() >>> for i in range(25): ... img = miscl_img[i].reshape(28, 28) ... ax[i].imshow(img, ... cmap='Greys', ... interpolation='nearest') ... ax[i].set_title('%d) t: %d p: %d' ... % (i+1, correct_lab[i], miscl_lab[i])) >>> ax[0].set_xticks([]) >>> ax[0].set_yticks([]) >>> plt.tight_layout() >>> plt.show()
We should now see a subplot matrix where the first number in the subtitles indicates the plot index, the second number indicates the true class label (t), and the third number stands for the predicted class label (p).
As we can see in the preceding figure, some of those images are even challenging for us humans to classify correctly. For example, we can see that the digit 9 is classified as a 3 or 8 if the lower part of the digit has a hook-like curvature (subplots 3, 16, and 17).
Implementing a multi-layer perceptron
In this subsection, we will now implement the code of an MLP with one input, one hidden, and one output layer to classify the images in the MNIST dataset. I have tried to keep the code as simple as possible. However, it may seem a little bit complicated at first, and I encourage you to download the sample code for this chapter from the Packt Publishing website, where you can find this MLP implementation annotated with comments and syntax highlighting for better readability. If you are not running the code from the accompanying IPython notebook, I recommend you copy it into a Python script file in your current working directory, for example, neuralnet.py
, which you can then import into your current Python session via the following command:
from neuralnet import NeuralNetMLP
The code will contain parts that we have not talked about yet, such as the backpropagation algorithm, but most of the code should look familiar to you based on the Adaline implementation in Chapter 2, Training Machine Learning Algorithms for Classification, and the discussion of forward propagation in earlier sections. Do not worry if not all of the code makes immediate sense to you; we will follow up on certain parts later in this chapter. However, going over the code at this stage can make it easier to follow the theory later.
import numpy as np from scipy.special import expit import sys class NeuralNetMLP(object): def __init__(self, n_output, n_features, n_hidden=30, l1=0.0, l2=0.0, epochs=500, eta=0.001, alpha=0.0, decrease_const=0.0, shuffle=True, minibatches=1, random_state=None): np.random.seed(random_state) self.n_output = n_output self.n_features = n_features self.n_hidden = n_hidden self.w1, self.w2 = self._initialize_weights() self.l1 = l1 self.l2 = l2 self.epochs = epochs self.eta = eta self.alpha = alpha self.decrease_const = decrease_const self.shuffle = shuffle self.minibatches = minibatches def _encode_labels(self, y, k): onehot = np.zeros((k, y.shape[0])) for idx, val in enumerate(y): onehot[val, idx] = 1.0 return onehot def _initialize_weights(self): w1 = np.random.uniform(-1.0, 1.0, size=self.n_hidden*(self.n_features + 1)) w1 = w1.reshape(self.n_hidden, self.n_features + 1) w2 = np.random.uniform(-1.0, 1.0, size=self.n_output*(self.n_hidden + 1)) w2 = w2.reshape(self.n_output, self.n_hidden + 1) return w1, w2 def _sigmoid(self, z): # expit is equivalent to 1.0/(1.0 + np.exp(-z)) return expit(z) def _sigmoid_gradient(self, z): sg = self._sigmoid(z) return sg * (1 - sg) def _add_bias_unit(self, X, how='column'): if how == 'column': X_new = np.ones((X.shape[0], X.shape[1]+1)) X_new[:, 1:] = X elif how == 'row': X_new = np.ones((X.shape[0]+1, X.shape[1])) X_new[1:, :] = X else: raise AttributeError('`how` must be `column` or `row`') return X_new def _feedforward(self, X, w1, w2): a1 = self._add_bias_unit(X, how='column') z2 = w1.dot(a1.T) a2 = self._sigmoid(z2) a2 = self._add_bias_unit(a2, how='row') z3 = w2.dot(a2) a3 = self._sigmoid(z3) return a1, z2, a2, z3, a3 def _L2_reg(self, lambda_, w1, w2): return (lambda_/2.0) * (np.sum(w1[:, 1:] ** 2)\ + np.sum(w2[:, 1:] ** 2)) def _L1_reg(self, lambda_, w1, w2): return (lambda_/2.0) * (np.abs(w1[:, 1:]).sum()\ + np.abs(w2[:, 1:]).sum()) def _get_cost(self, y_enc, output, w1, w2): term1 = -y_enc * (np.log(output)) term2 = (1 - y_enc) * np.log(1 - output) cost = np.sum(term1 - term2) L1_term = self._L1_reg(self.l1, w1, w2) L2_term = self._L2_reg(self.l2, w1, w2) cost = cost + L1_term + L2_term return cost def _get_gradient(self, a1, a2, a3, z2, y_enc, w1, w2): # backpropagation sigma3 = a3 - y_enc z2 = self._add_bias_unit(z2, how='row') sigma2 = w2.T.dot(sigma3) * self._sigmoid_gradient(z2) sigma2 = sigma2[1:, :] grad1 = sigma2.dot(a1) grad2 = sigma3.dot(a2.T) # regularize grad1[:, 1:] += (w1[:, 1:] * (self.l1 + self.l2)) grad2[:, 1:] += (w2[:, 1:] * (self.l1 + self.l2)) return grad1, grad2 def predict(self, X): a1, z2, a2, z3, a3 = self._feedforward(X, self.w1, self.w2) y_pred = np.argmax(z3, axis=0) return y_pred def fit(self, X, y, print_progress=False): self.cost_ = [] X_data, y_data = X.copy(), y.copy() y_enc = self._encode_labels(y, self.n_output) delta_w1_prev = np.zeros(self.w1.shape) delta_w2_prev = np.zeros(self.w2.shape) for i in range(self.epochs): # adaptive learning rate self.eta /= (1 + self.decrease_const*i) if print_progress: sys.stderr.write( '\rEpoch: %d/%d' % (i+1, self.epochs)) sys.stderr.flush() if self.shuffle: idx = np.random.permutation(y_data.shape[0]) X_data, y_data = X_data[idx], y_data[idx] mini = np.array_split(range( y_data.shape[0]), self.minibatches) for idx in mini: # feedforward a1, z2, a2, z3, a3 = self._feedforward( X[idx], self.w1, self.w2) cost = self._get_cost(y_enc=y_enc[:, idx], output=a3, w1=self.w1, w2=self.w2) self.cost_.append(cost) # compute gradient via backpropagation grad1, grad2 = self._get_gradient(a1=a1, a2=a2, a3=a3, z2=z2, y_enc=y_enc[:, idx], w1=self.w1, w2=self.w2) # update weights delta_w1, delta_w2 = self.eta * grad1,\ self.eta * grad2 self.w1 -= (delta_w1 + (self.alpha * delta_w1_prev)) self.w2 -= (delta_w2 + (self.alpha * delta_w2_prev)) delta_w1_prev, delta_w2_prev = delta_w1, delta_w2 return self
Now, let's initialize a new 784-50-10 MLP, a neural network with 784 input units (n_features
), 50 hidden units (n_hidden
), and 10 output units (n_output
):
>>> nn = NeuralNetMLP(n_output=10, ... n_features=X_train.shape[1], ... n_hidden=50, ... l2=0.1, ... l1=0.0, ... epochs=1000, ... eta=0.001, ... alpha=0.001, ... decrease_const=0.00001, ... shuffle=True, ... minibatches=50, ... random_state=1)
As you may have noticed, by going over our preceding MLP implementation, we also implemented some additional features, which are summarized here:
l2
: The parameter for L2 regularization to decrease the degree of overfitting; equivalently,l1
is the parameter for L1 regularization.epochs
: The number of passes over the training set.eta
: The learning rate .alpha
: A parameter for momentum learning to add a factor of the previous gradient to the weight update for faster learning (where is the current time step or epoch).decrease_const
: The decrease constant for an adaptive learning rate that decreases over time for better convergence .shuffle
: Shuffling the training set prior to every epoch to prevent the algorithm from getting stuck in cycles.Minibatches
: Splitting of the training data into k mini-batches in each epoch. The gradient is computed for each mini-batch separately instead of the entire training data for faster learning.
Next, we train the MLP using 60,000 samples from the already shuffled MNIST training dataset. Before you execute the following code, please note that training the neural network may take 10-30 minutes on standard desktop computer hardware:
>>> nn.fit(X_train, y_train, print_progress=True) Epoch: 1000/1000
Similar to our previous Adaline implementation, we save the cost for each epoch in a cost_
list that we can now visualize, making sure that the optimization algorithm reached convergence. Here, we only plot every 50th step to account for the 50 mini-batches (50 mini-batches × 1000 epochs). The code is as follows:
>>> plt.plot(range(len(nn.cost_)), nn.cost_) >>> plt.ylim([0, 2000]) >>> plt.ylabel('Cost') >>> plt.xlabel('Epochs * 50') >>> plt.tight_layout() >>> plt.show()
As we see in the following plot, the graph of the cost function looks very noisy. This is due to the fact that we trained our neural network with mini-batch learning, a variant of stochastic gradient descent.
Although we can already see in the plot that the optimization algorithm converged after approximately 800 epochs (40,000/50 = 800), let's plot a smoother version of the cost function against the number of epochs by averaging over the mini-batch intervals. The code is as follows:
>>> batches = np.array_split(range(len(nn.cost_)), 1000) >>> cost_ary = np.array(nn.cost_) >>> cost_avgs = [np.mean(cost_ary[i]) for i in batches] >>> plt.plot(range(len(cost_avgs)), ... cost_avgs, ... color='red') >>> plt.ylim([0, 2000]) >>> plt.ylabel('Cost') >>> plt.xlabel('Epochs') >>> plt.tight_layout() >>> plt.show()
The following plot gives us a clearer picture indicating that the training algorithm converged shortly after the 800th epoch:
Now, let's evaluate the performance of the model by calculating the prediction accuracy:
>>> y_train_pred = nn.predict(X_train) >>> acc = np.sum(y_train == y_train_pred, axis=0) / X_train.shape[0] >>> print('Training accuracy: %.2f%%' % (acc * 100)) Training accuracy: 97.74%
As we can see, the model classifies most of the training digits correctly, but how does it generalize to data that it has not seen before? Let's calculate the accuracy on 10,000 images in the test dataset:
>>> y_test_pred = nn.predict(X_test) >>> acc = np.sum(y_test == y_test_pred, axis=0) / X_test.shape[0] >>> print('Training accuracy: %.2f%%' % (acc * 100)) Test accuracy: 96.18%
Based on the small discrepancy between training and test accuracy, we can conclude that the model only slightly overfits the training data. To further fine-tune the model, we could change the number of hidden units, values of the regularization parameters, learning rate, values of the decrease constant, or the adaptive learning using the techniques that we discussed in Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning (this is left as an exercise for the reader).
Now, let's take a look at some of the images that our MLP struggles with:
>>> miscl_img = X_test[y_test != y_test_pred][:25] >>> correct_lab = y_test[y_test != y_test_pred][:25] >>> miscl_lab= y_test_pred[y_test != y_test_pred][:25] >>> fig, ax = plt.subplots(nrows=5, ... ncols=5, ... sharex=True, ... sharey=True,) >>> ax = ax.flatten() >>> for i in range(25): ... img = miscl_img[i].reshape(28, 28) ... ax[i].imshow(img, ... cmap='Greys', ... interpolation='nearest') ... ax[i].set_title('%d) t: %d p: %d' ... % (i+1, correct_lab[i], miscl_lab[i])) >>> ax[0].set_xticks([]) >>> ax[0].set_yticks([]) >>> plt.tight_layout() >>> plt.show()
We should now see a subplot matrix where the first number in the subtitles indicates the plot index, the second number indicates the true class label (t), and the third number stands for the predicted class label (p).
As we can see in the preceding figure, some of those images are even challenging for us humans to classify correctly. For example, we can see that the digit 9 is classified as a 3 or 8 if the lower part of the digit has a hook-like curvature (subplots 3, 16, and 17).
Training an artificial neural network
Now that we have seen a neural network in action and have gained a basic understanding of how it works by looking over the code, let's dig a little bit deeper into some of the concepts, such as the logistic cost function and the backpropagation algorithm that we implemented to learn the weights.
Computing the logistic cost function
The logistic cost function that we implemented as the _get_cost
method is actually pretty simple to follow since it is the same cost function that we described in the logistic regression section in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn.
Here, is the sigmoid activation of the unit in one of the layers which we compute in the forward propagation step:
Now, let's add a regularization term, which allows us to reduce the degree of overfitting. As you will recall from earlier chapters, the L2 and L1 regularization terms are defined as follows (remember that we don't regularize the bias units):
Although our MLP implementation supports both L1 and L2 regularization, we will now only focus on the L2 regularization term for simplicity. However, the same concepts apply to the L1 regularization term. By adding the L2 regularization term to our logistic cost function, we obtain the following equation:
Since we implemented an MLP for multi-class classification, this returns an output vector of elements, which we need to compare with the dimensional target vector in the one-hot encoding representation. For example, the activation of the third layer and the target class (here: class 2) for a particular sample may look like this:
Thus, we need to generalize the logistic cost function to all activation units in our network. So our cost function (without the regularization term) becomes:
Here, the superscript is the index of a particular sample in our training set.
The following generalized regularization term may look a little bit complicated at first, but here we are just calculating the sum of all weights of a layer (without the bias term) that we added to the first column:
The following equation represents the L2-penalty term:
Remember that our goal is to minimize the cost function . Thus, we need to calculate the partial derivative of matrix with respect to each weight for every layer in the network:
In the next section, we will talk about the backpropagation algorithm, which allows us to calculate these partial derivatives to minimize the cost function.
Note that consists of multiple matrices. In a multi-layer perceptron with one hidden unit, we have the weight matrix , which connects the input to the hidden layer, and , which connects the hidden layer to the output layer. An intuitive visualization of the matrix is provided in the following figure:
In this simplified figure, it may seem that both and have the same number of rows and columns, which is typically not the case unless we initialize an MLP with the same number of hidden units, output units, and input features.
If this may sound confusing, stay tuned for the next section where we will discuss the dimensionality of and in more detail in the context of the backpropagation algorithm.
Training neural networks via backpropagation
In this section, we will go through the math of backpropagation to understand how you can learn the weights in a neural network very efficiently. Depending on how comfortable you are with mathematical representations, the following equations may seem relatively complicated at first. Many people prefer a bottom-up approach and like to go over the equations step by step to develop an intuition for algorithms. However, if you prefer a top-down approach and want to learn about backpropagation without all the mathematical notations, I recommend you to read the next section Developing your intuition for backpropagation first and revisit this section later.
In the previous section, we saw how to calculate the cost as the difference between the activation of the last layer and the target class label. Now, we will see how the backpropagation algorithm works to update the weights in our MLP model, which we implemented in the _get_gradient
method. As we recall from the beginning of this chapter, we first need to apply forward propagation in order to obtain the activation of the output layer, which we formulated as follows:
Concisely, we just forward propagate the input features through the connection in the network as shown here:
In backpropagation, we propagate the error from right to left. We start by calculating the error vector of the output layer:
Here, is the vector of the true class labels.
Next, we calculate the error term of the hidden layer:
Here, is simply the derivative of the sigmoid activation function, which we implemented as _sigmoid_gradient
:
Note that the asterisk symbol means element-wise multiplication in this context.
Note
Although, it is not important to follow the next equations, you may be curious as to how I obtained the derivative of the activation function. I summarized the derivation step by step here:
To better understand how we compute the term, let's walk through it in more detail. In the preceding equation, we multiplied the transpose of the dimensional matrix ; t is the number of output class labels and h is the number of hidden units). Now, becomes an dimensional matrix with , which is a dimensional vector. We then performed a pair-wise multiplication between and , which is also a dimensional vector. Eventually, after obtaining the terms, we can now write the derivation of the cost function as follows:
Next, we need to accumulate the partial derivative of every th node in layer and the th error of the node in layer :
Remember that we need to compute for every sample in the training set. Thus, it is easier to implement it as a vectorized version like in our preceding MLP code implementation:
After we have accumulated the partial derivatives, we can add the regularization term as follows:
Lastly, after we have computed the gradients, we can now update the weights by taking an opposite step towards the gradient:
To bring everything together, let's summarize backpropagation in the following figure:
Computing the logistic cost function
The logistic cost function that we implemented as the _get_cost
method is actually pretty simple to follow since it is the same cost function that we described in the logistic regression section in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn.
Here, is the sigmoid activation of the unit in one of the layers which we compute in the forward propagation step:
Now, let's add a regularization term, which allows us to reduce the degree of overfitting. As you will recall from earlier chapters, the L2 and L1 regularization terms are defined as follows (remember that we don't regularize the bias units):
Although our MLP implementation supports both L1 and L2 regularization, we will now only focus on the L2 regularization term for simplicity. However, the same concepts apply to the L1 regularization term. By adding the L2 regularization term to our logistic cost function, we obtain the following equation:
Since we implemented an MLP for multi-class classification, this returns an output vector of elements, which we need to compare with the dimensional target vector in the one-hot encoding representation. For example, the activation of the third layer and the target class (here: class 2) for a particular sample may look like this:
Thus, we need to generalize the logistic cost function to all activation units in our network. So our cost function (without the regularization term) becomes:
Here, the superscript is the index of a particular sample in our training set.
The following generalized regularization term may look a little bit complicated at first, but here we are just calculating the sum of all weights of a layer (without the bias term) that we added to the first column:
The following equation represents the L2-penalty term:
Remember that our goal is to minimize the cost function . Thus, we need to calculate the partial derivative of matrix with respect to each weight for every layer in the network:
In the next section, we will talk about the backpropagation algorithm, which allows us to calculate these partial derivatives to minimize the cost function.
Note that consists of multiple matrices. In a multi-layer perceptron with one hidden unit, we have the weight matrix , which connects the input to the hidden layer, and , which connects the hidden layer to the output layer. An intuitive visualization of the matrix is provided in the following figure:
In this simplified figure, it may seem that both and have the same number of rows and columns, which is typically not the case unless we initialize an MLP with the same number of hidden units, output units, and input features.
If this may sound confusing, stay tuned for the next section where we will discuss the dimensionality of and in more detail in the context of the backpropagation algorithm.
Training neural networks via backpropagation
In this section, we will go through the math of backpropagation to understand how you can learn the weights in a neural network very efficiently. Depending on how comfortable you are with mathematical representations, the following equations may seem relatively complicated at first. Many people prefer a bottom-up approach and like to go over the equations step by step to develop an intuition for algorithms. However, if you prefer a top-down approach and want to learn about backpropagation without all the mathematical notations, I recommend you to read the next section Developing your intuition for backpropagation first and revisit this section later.
In the previous section, we saw how to calculate the cost as the difference between the activation of the last layer and the target class label. Now, we will see how the backpropagation algorithm works to update the weights in our MLP model, which we implemented in the _get_gradient
method. As we recall from the beginning of this chapter, we first need to apply forward propagation in order to obtain the activation of the output layer, which we formulated as follows:
Concisely, we just forward propagate the input features through the connection in the network as shown here:
In backpropagation, we propagate the error from right to left. We start by calculating the error vector of the output layer:
Here, is the vector of the true class labels.
Next, we calculate the error term of the hidden layer:
Here, is simply the derivative of the sigmoid activation function, which we implemented as _sigmoid_gradient
:
Note that the asterisk symbol means element-wise multiplication in this context.
Note
Although, it is not important to follow the next equations, you may be curious as to how I obtained the derivative of the activation function. I summarized the derivation step by step here:
To better understand how we compute the term, let's walk through it in more detail. In the preceding equation, we multiplied the transpose of the dimensional matrix ; t is the number of output class labels and h is the number of hidden units). Now, becomes an dimensional matrix with , which is a dimensional vector. We then performed a pair-wise multiplication between and , which is also a dimensional vector. Eventually, after obtaining the terms, we can now write the derivation of the cost function as follows:
Next, we need to accumulate the partial derivative of every th node in layer and the th error of the node in layer :
Remember that we need to compute for every sample in the training set. Thus, it is easier to implement it as a vectorized version like in our preceding MLP code implementation:
After we have accumulated the partial derivatives, we can add the regularization term as follows:
Lastly, after we have computed the gradients, we can now update the weights by taking an opposite step towards the gradient:
To bring everything together, let's summarize backpropagation in the following figure:
Training neural networks via backpropagation
In this section, we will go through the math of backpropagation to understand how you can learn the weights in a neural network very efficiently. Depending on how comfortable you are with mathematical representations, the following equations may seem relatively complicated at first. Many people prefer a bottom-up approach and like to go over the equations step by step to develop an intuition for algorithms. However, if you prefer a top-down approach and want to learn about backpropagation without all the mathematical notations, I recommend you to read the next section Developing your intuition for backpropagation first and revisit this section later.
In the previous section, we saw how to calculate the cost as the difference between the activation of the last layer and the target class label. Now, we will see how the backpropagation algorithm works to update the weights in our MLP model, which we implemented in the _get_gradient
method. As we recall from the beginning of this chapter, we first need to apply forward propagation in order to obtain the activation of the output layer, which we formulated as follows:
Concisely, we just forward propagate the input features through the connection in the network as shown here:
In backpropagation, we propagate the error from right to left. We start by calculating the error vector of the output layer:
Here, is the vector of the true class labels.
Next, we calculate the error term of the hidden layer:
Here, is simply the derivative of the sigmoid activation function, which we implemented as _sigmoid_gradient
:
Note that the asterisk symbol means element-wise multiplication in this context.
Note
Although, it is not important to follow the next equations, you may be curious as to how I obtained the derivative of the activation function. I summarized the derivation step by step here:
To better understand how we compute the term, let's walk through it in more detail. In the preceding equation, we multiplied the transpose of the dimensional matrix ; t is the number of output class labels and h is the number of hidden units). Now, becomes an dimensional matrix with , which is a dimensional vector. We then performed a pair-wise multiplication between and , which is also a dimensional vector. Eventually, after obtaining the terms, we can now write the derivation of the cost function as follows:
Next, we need to accumulate the partial derivative of every th node in layer and the th error of the node in layer :
Remember that we need to compute for every sample in the training set. Thus, it is easier to implement it as a vectorized version like in our preceding MLP code implementation:
After we have accumulated the partial derivatives, we can add the regularization term as follows:
Lastly, after we have computed the gradients, we can now update the weights by taking an opposite step towards the gradient:
To bring everything together, let's summarize backpropagation in the following figure:
Developing your intuition for backpropagation
Although backpropagation was rediscovered and popularized almost 30 years ago, it still remains one of the most widely used algorithms to train artificial neural networks very efficiently. In this section, we'll see a more intuitive summary and the bigger picture of how this fascinating algorithm works.
In essence, backpropagation is just a very computationally efficient approach to compute the derivatives of a complex cost function. Our goal is to use those derivatives to learn the weight coefficients for parameterizing a multi-layer artificial neural network. The challenge in the parameterization of neural networks is that we are typically dealing with a very large number of weight coefficients in a high-dimensional feature space. In contrast to other cost functions that we have seen in previous chapters, the error surface of a neural network cost function is not convex or smooth. There are many bumps in this high-dimensional cost surface (local minima) that we have to overcome in order to find the global minimum of the cost function.
You may recall the concept of the chain rule from your introductory calculus classes. The chain rule is an approach to deriving a complex, nested function, for example, that is broken down into basic components:
In the context of computer algebra, a set of techniques has been developed to solve such problems very efficiently, which is also known as automatic differentiation. If you are interested in learning more about automatic differentiation in machine learning applications, I recommend you to refer to the following resource: A. G. Baydin and B. A. Pearlmutter. Automatic Differentiation of Algorithms for Machine Learning. arXiv preprint arXiv:1404.7456, 2014, which is freely available on arXiv at http://arxiv.org/pdf/1404.7456.pdf.
Automatic differentiation comes with two modes, the forward and the reverse mode, respectively. Backpropagation is simply just a special case of the reverse-mode automatic differentiation. The key point is that applying the chain rule in the forward mode can be quite expensive since we would have to multiply large matrices for each layer (Jacobians) that we eventually multiply by a vector to obtain the output. The trick of the reverse mode is that we start from right to left: we multiply a matrix by a vector, which yields another vector that is multiplied by the next matrix and so on. Matrix-vector multiplication is computationally much cheaper than matrix-matrix multiplication, which is why backpropagation is one of the most popular algorithms used in neural network training.
Debugging neural networks with gradient checking
Implementations of artificial neural networks can be quite complex, and it is always a good idea to manually check that we have implemented backpropagation correctly. In this section, we will talk about a simple procedure called gradient checking, which is essentially a comparison between our analytical gradients in the network and numerical gradients. Gradient checking is not specific to feedforward neural networks but can be applied to any other neural network architecture that uses gradient-based optimization. Even if you are planning to implement more trivial algorithms using gradient-based optimization, such as linear regression, logistic regression, and support vector machines, it is generally not a bad idea to check if the gradients are computed correctly.
In the previous sections, we defined a cost function where is the matrix of the weight coefficients of an artificial network. Note that is—roughly speaking—a "stacked" matrix consisting of the matrices and in a multi-layer perceptron with one hidden unit. We defined as the -dimensional matrix that connects the input layer to the hidden layer, where is the number of hidden units and is the number of features (input units). The matrix that connects the hidden layer to the output layer has the dimensions , where is the number of output units. We then calculated the derivative of the cost function for a weight as follows:
Remember that we are updating the weights by taking an opposite step towards the direction of the gradient. In gradient checking, we compare this analytical solution to a numerically approximated gradient:
Here, is typically a very small number, for example 1e-5 (note that 1e-5 is just a more convenient notation for 0.00001). Intuitively, we can think of this finite difference approximation as the slope of the secant line connecting the points of the cost function for the two weights and (both are scalar values), as shown in the following figure. We are omitting the superscripts and subscripts for simplicity.
An even better approach that yields a more accurate approximation of the gradient is to compute the symmetric (or centered) difference quotient given by the two-point formula:
Typically, the approximated difference between the numerical gradient and analytical gradient is then calculated as the L2 vector norm. For practical reasons, we unroll the computed gradient matrices into flat vectors so that we can calculate the error (the difference between the gradient vectors) more conveniently:
One problem is that the error is not scale invariant (small errors are more significant if the weight vector norms are small too). Thus, it is recommended to calculate a normalized difference:
Now, we want the relative error between the numerical gradient and the analytical gradient to be as small as possible. Before we implement gradient checking, we need to discuss one more detail: what is the acceptable error threshold to pass the gradient check? The relative error threshold for passing the gradient check depends on the complexity of the network architecture. As a rule of thumb, the more hidden layers we add, the larger the difference between the numerical and analytical gradient can become if backpropagation is implemented correctly. Since we have implemented a relatively simple neural network architecture in this chapter, we want to be rather strict about the threshold and define the following rules:
- Relative error <= 1e-7 means everything is okay!
- Relative error <= 1e-4 means the condition is problematic, and we should look into it.
- Relative error > 1e-4 means there is probably something wrong in our code.
Now we have established these ground rules, let's implement gradient checking. To do so, we can simply take the NeuralNetMLP
class that we implemented previously and add the following method to the class body:
def _gradient_checking(self, X, y_enc, w1, w2, epsilon, grad1, grad2): """ Apply gradient checking (for debugging only) Returns --------- relative_error : float Relative error between the numerically approximated gradients and the backpropagated gradients. """ num_grad1 = np.zeros(np.shape(w1)) epsilon_ary1 = np.zeros(np.shape(w1)) for i in range(w1.shape[0]): for j in range(w1.shape[1]): epsilon_ary1[i, j] = epsilon a1, z2, a2, z3, a3 = self._feedforward( X, w1 - epsilon_ary1, w2) cost1 = self._get_cost(y_enc, a3, w1-epsilon_ary1, w2) a1, z2, a2, z3, a3 = self._feedforward( X, w1 + epsilon_ary1, w2) cost2 = self._get_cost(y_enc, a3, w1 + epsilon_ary1, w2) num_grad1[i, j] = (cost2 - cost1) / (2 * epsilon) epsilon_ary1[i, j] = 0 num_grad2 = np.zeros(np.shape(w2)) epsilon_ary2 = np.zeros(np.shape(w2)) for i in range(w2.shape[0]): for j in range(w2.shape[1]): epsilon_ary2[i, j] = epsilon a1, z2, a2, z3, a3 = self._feedforward( X, w1, w2 - epsilon_ary2) cost1 = self._get_cost(y_enc, a3, w1, w2 - epsilon_ary2) a1, z2, a2, z3, a3 = self._feedforward( X, w1, w2 + epsilon_ary2) cost2 = self._get_cost(y_enc, a3, w1, w2 + epsilon_ary2) num_grad2[i, j] = (cost2 - cost1) / (2 * epsilon) epsilon_ary2[i, j] = 0 num_grad = np.hstack((num_grad1.flatten(), num_grad2.flatten())) grad = np.hstack((grad1.flatten(), grad2.flatten())) norm1 = np.linalg.norm(num_grad - grad) norm2 = np.linalg.norm(num_grad) norm3 = np.linalg.norm(grad) relative_error = norm1 / (norm2 + norm3) return relative_error
The _gradient_checking
code seems rather simple. However, my personal recommendation is to keep it as simple as possible. Our goal is to double-check the gradient computation, so we want to make sure that we do not introduce any additional mistakes in gradient checking by writing efficient but complex code. Next, we only need to make a small modification to the fit
method. In the following code, I omitted the code at the beginning of the fit
function for clarity, and the only lines that we need to add to the method are implemented between the comments ## start gradient checking
and ## end gradient checking
:
class MLPGradientCheck(object): [...] def fit(self, X, y, print_progress=False): [...] # compute gradient via backpropagation grad1, grad2 = self._get_gradient( a1=a1, a2=a2, a3=a3, z2=z2, y_enc=y_enc[:, idx], w1=self.w1, w2=self.w2) ## start gradient checking grad_diff = self._gradient_checking( X=X[idx], y_enc=y_enc[:, idx], w1=self.w1, w2=self.w2, epsilon=1e-5, grad1=grad1, grad2=grad2) if grad_diff <= 1e-7: print('Ok: %s' % grad_diff) elif grad_diff <= 1e-4: print('Warning: %s' % grad_diff) else: print('PROBLEM: %s' % grad_diff) ## end gradient checking # update weights; [alpha * delta_w_prev] # for momentum learning delta_w1 = self.eta * grad1 delta_w2 = self.eta * grad2 self.w1 -= (delta_w1 +\ (self.alpha * delta_w1_prev)) self.w2 -= (delta_w2 +\ (self.alpha * delta_w2_prev)) delta_w1_prev = delta_w1 delta_w2_prev = delta_w2 return self
Assuming that we named our modified multi-layer perceptron class MLPGradientCheck
, we can now initialize a new MLP with 10 hidden layers. Also, we disable regularization, adaptive learning, and momentum learning. In addition, we use regular gradient descent by setting minibatches
to 1. The code is as follows:
>>> nn_check = MLPGradientCheck(n_output=10, n_features=X_train.shape[1], n_hidden=10, l2=0.0, l1=0.0, epochs=10, eta=0.001, alpha=0.0, decrease_const=0.0, minibatches=1, random_state=1)
One downside of gradient checking is that it is computationally very, very expensive. Training a neural network with gradient checking enabled is so slow that we really only want to use it for debugging purposes. For this reason, it is not uncommon to run gradient checking only on a handful of training samples (here, we choose 5
). The code is as follows:
>>> nn_check.fit(X_train[:5], y_train[:5], print_progress=False) Ok: 2.56712936241e-10 Ok: 2.94603251069e-10 Ok: 2.37615620231e-10 Ok: 2.43469423226e-10 Ok: 3.37872073158e-10 Ok: 3.63466384861e-10 Ok: 2.22472120785e-10 Ok: 2.33163708438e-10 Ok: 3.44653686551e-10 Ok: 2.17161707211e-10
As we can see from the code output, our multi-layer perceptron passes this test with excellent results.
Convergence in neural networks
You might be wondering why we did not use regular gradient descent but mini-batch learning to train our neural network for the handwritten digit classification. You may recall our discussion on stochastic gradient descent that we used to implement online learning. In online learning, we compute the gradient based on a single training example at a time to perform the weight update. Although this is a stochastic approach, it often leads to very accurate solutions with a much faster convergence than regular gradient descent. Mini-batch learning is a special form of stochastic gradient descent where we compute the gradient based on a subset of the training samples with . Mini-batch learning has the advantage over online learning that we can make use of our vectorized implementations to improve computational efficiency. However, we can update the weights much faster than in regular gradient descent. Intuitively, you can think of mini-batch learning as predicting the vote turnout of a presidential election from a poll by asking only a representative subset of the population rather than asking the entire population.
In addition, we added more tuning parameters such as the decrease constant and a parameter for an adaptive learning rate. The reason is that neural networks are much harder to train than simpler algorithms such as Adaline, logistic regression, or support vector machines. In multi-layer neural networks, we typically have hundreds, thousands, or even billions of weights that we need to optimize. Unfortunately, the output function has a rough surface and the optimization algorithm can easily become trapped in local minima, as shown in the following figure:
Note that this representation is extremely simplified since our neural network has many dimensions; it makes it impossible to visualize the actual cost surface for the human eye. Here, we only show the cost surface for a single weight on the x axis. However, the main message is that we do not want our algorithm to get trapped in local minima. By increasing the learning rate, we can more readily escape such local minima. On the other hand, we also increase the chance of overshooting the global optimum if the learning rate is too large. Since we initialize the weights randomly, we start with a solution to the optimization problem that is typically hopelessly wrong. A decrease constant, which we defined earlier, can help us to climb down the cost surface faster in the beginning and the adaptive learning rate allows us to better anneal to the global minimum.
Other neural network architectures
In this chapter, we discussed one of the most popular feedforward neural network representations, the multi-layer perceptron. Neural networks are currently one of the most active research topics in the machine learning field, and there are many other neural network architectures that are well beyond the scope of this book. If you are interested in learning more about neural networks and algorithms for deep learning, I recommend reading the introduction and overview; Y. Bengio. Learning Deep Architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009. Yoshua Bengio's book is currently freely available at http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf.
Although neural networks really are a topic for another book, let's take at least a brief look at two other popular architectures, convolutional neural networks and recurrent neural networks.
Convolutional Neural Networks
Convolutional Neural Networks (CNNs or ConvNets) gained popularity in computer vision due to their extraordinary good performance on image classification tasks. As of today, CNNs are one of the most popular neural network architectures in deep learning. The key idea behind convolutional neural networks is to build many layers of feature detectors to take the spatial arrangement of pixels in an input image into account. Note that there exist many different variants of CNNs. In this section, we will discuss only the general idea behind this architecture. If you are interested in learning more about CNNs, I recommend you to take a look at the publications of Yann LeCun (http://yann.lecun.com), who is one of the co-inventors of CNNs. In particular, I can recommend the following literature for getting started with CNNs:
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- P. Y. Simard, D. Steinkraus, and J. C. Platt. Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis. IEEE, 2003, p.958.
As you will recall from our multi-layer perceptron implementation, we unrolled the images into feature vectors and these inputs were fully connected to the hidden layer—spatial information was not encoded in this network architecture. In CNNs, we use receptive fields to connect the input layer to a feature map. These receptive fields can be understood as overlapping windows that we slide over the pixels of an input image to create a feature map. The stride lengths of the window sliding as well as the window size are additional hyperparameters of the model that we need to define a priori. The process of creating the feature map is also called convolution. An example of such a convolutional layer, the layer that connects the input pixels to each unit in the feature map, is shown in the following figure:
It is important to note that the feature detectors are replicates, which means that the receptive fields that map the features to the units in the next layer share the same weights. Here, the key idea is that if a feature detector is useful in one part of the image, it might be useful in another part as well. The nice side effect of this approach is that it greatly reduces the number of parameters that need to be learned. Since we allow different patches of the image to be represented in different ways, CNNs are particularly good at recognizing objects of different sizes and different positions in an image. We do not need to worry so much about rescaling and centering the images as it has been done in MNIST.
In CNNs, a convolutional layer is followed by a pooling layer (sometimes also called sub-sampling). In pooling, we summarize neighboring feature detectors to reduce the number of features for the next layer. Pooling can be understood as a simple method of feature extraction where we take the average or maximum value of a patch of neighboring features and pass it on to the next layer. To create a deep convolutional neural network, we stack multiple layers—alternating between convolutional and pooling layers—before we connect it to a multi-layer perceptron for classification. This is shown in the following figure:
Recurrent Neural Networks
Recurrent Neural Networks (RNNs) can be thought of as feedforward neural networks with feedback loops or backpropagation through time. In RNNs, the neurons only fire for a limited amount of time before they are (temporarily) deactivated. In turn, these neurons activate other neurons that fire at a later point in time. Basically, we can think of recurrent neural networks as MLPs with an additional time variable. The time component and dynamic structure allows the network to use not only the current inputs but also the inputs that it encountered earlier.
Although RNNs achieved remarkable results in speech recognition, language translation, and connected handwriting recognition, these network architectures are typically much harder to train. This is because we cannot simply backpropagate the error layer by layer; we have to consider the additional time component, which amplifies the vanishing and exploding gradient problem. In 1997, Juergen Schmidhuber and his co-workers introduced the so-called long short-term memory units to overcome this problem: Long Short Term Memory (LSTM) units; S. Hochreiter and J. Schmidhuber. Long Short-term Memory. Neural Computation, 9(8):1735–1780, 1997.
However, we should note that there are many different variants of RNNs, and a detailed discussion is beyond the scope of this book.
Convolutional Neural Networks
Convolutional Neural Networks (CNNs or ConvNets) gained popularity in computer vision due to their extraordinary good performance on image classification tasks. As of today, CNNs are one of the most popular neural network architectures in deep learning. The key idea behind convolutional neural networks is to build many layers of feature detectors to take the spatial arrangement of pixels in an input image into account. Note that there exist many different variants of CNNs. In this section, we will discuss only the general idea behind this architecture. If you are interested in learning more about CNNs, I recommend you to take a look at the publications of Yann LeCun (http://yann.lecun.com), who is one of the co-inventors of CNNs. In particular, I can recommend the following literature for getting started with CNNs:
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- P. Y. Simard, D. Steinkraus, and J. C. Platt. Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis. IEEE, 2003, p.958.
As you will recall from our multi-layer perceptron implementation, we unrolled the images into feature vectors and these inputs were fully connected to the hidden layer—spatial information was not encoded in this network architecture. In CNNs, we use receptive fields to connect the input layer to a feature map. These receptive fields can be understood as overlapping windows that we slide over the pixels of an input image to create a feature map. The stride lengths of the window sliding as well as the window size are additional hyperparameters of the model that we need to define a priori. The process of creating the feature map is also called convolution. An example of such a convolutional layer, the layer that connects the input pixels to each unit in the feature map, is shown in the following figure:
It is important to note that the feature detectors are replicates, which means that the receptive fields that map the features to the units in the next layer share the same weights. Here, the key idea is that if a feature detector is useful in one part of the image, it might be useful in another part as well. The nice side effect of this approach is that it greatly reduces the number of parameters that need to be learned. Since we allow different patches of the image to be represented in different ways, CNNs are particularly good at recognizing objects of different sizes and different positions in an image. We do not need to worry so much about rescaling and centering the images as it has been done in MNIST.
In CNNs, a convolutional layer is followed by a pooling layer (sometimes also called sub-sampling). In pooling, we summarize neighboring feature detectors to reduce the number of features for the next layer. Pooling can be understood as a simple method of feature extraction where we take the average or maximum value of a patch of neighboring features and pass it on to the next layer. To create a deep convolutional neural network, we stack multiple layers—alternating between convolutional and pooling layers—before we connect it to a multi-layer perceptron for classification. This is shown in the following figure:
Recurrent Neural Networks
Recurrent Neural Networks (RNNs) can be thought of as feedforward neural networks with feedback loops or backpropagation through time. In RNNs, the neurons only fire for a limited amount of time before they are (temporarily) deactivated. In turn, these neurons activate other neurons that fire at a later point in time. Basically, we can think of recurrent neural networks as MLPs with an additional time variable. The time component and dynamic structure allows the network to use not only the current inputs but also the inputs that it encountered earlier.
Although RNNs achieved remarkable results in speech recognition, language translation, and connected handwriting recognition, these network architectures are typically much harder to train. This is because we cannot simply backpropagate the error layer by layer; we have to consider the additional time component, which amplifies the vanishing and exploding gradient problem. In 1997, Juergen Schmidhuber and his co-workers introduced the so-called long short-term memory units to overcome this problem: Long Short Term Memory (LSTM) units; S. Hochreiter and J. Schmidhuber. Long Short-term Memory. Neural Computation, 9(8):1735–1780, 1997.
However, we should note that there are many different variants of RNNs, and a detailed discussion is beyond the scope of this book.
Recurrent Neural Networks
Recurrent Neural Networks (RNNs) can be thought of as feedforward neural networks with feedback loops or backpropagation through time. In RNNs, the neurons only fire for a limited amount of time before they are (temporarily) deactivated. In turn, these neurons activate other neurons that fire at a later point in time. Basically, we can think of recurrent neural networks as MLPs with an additional time variable. The time component and dynamic structure allows the network to use not only the current inputs but also the inputs that it encountered earlier.
Although RNNs achieved remarkable results in speech recognition, language translation, and connected handwriting recognition, these network architectures are typically much harder to train. This is because we cannot simply backpropagate the error layer by layer; we have to consider the additional time component, which amplifies the vanishing and exploding gradient problem. In 1997, Juergen Schmidhuber and his co-workers introduced the so-called long short-term memory units to overcome this problem: Long Short Term Memory (LSTM) units; S. Hochreiter and J. Schmidhuber. Long Short-term Memory. Neural Computation, 9(8):1735–1780, 1997.
However, we should note that there are many different variants of RNNs, and a detailed discussion is beyond the scope of this book.
A few last words about neural network implementation
You might be wondering why we went through all of this theory just to implement a simple multi-layer artificial network that can classify handwritten digits instead of using an open source Python machine learning library. One reason is that at the time of writing this book, scikit-learn does not have an MLP implementation. More importantly, we (machine learning practitioners) should have at least a basic understanding of the algorithms that we are using in order to apply machine learning techniques appropriately and successfully.
Now that we know how feedforward neural networks work, we are ready to explore more sophisticated Python libraries built on top of NumPy such as Theano (http://deeplearning.net/software/theano/), which allows us to construct neural networks more efficiently. We will see this in Chapter 13, Parallelizing Neural Network Training with Theano. Over the last couple of years, Theano has gained a lot of popularity among machine learning researchers, who use it to construct deep neural networks because of its ability to optimize mathematical expressions for computations on multi-dimensional arrays utilizing Graphical Processing Units (GPUs).
A great collection of Theano tutorials can be found at http://deeplearning.net/software/theano/tutorial/index.html#tutorial.
There are also a number of interesting libraries that are being actively developed to train neural networks in Theano, which you should keep on your radar:
- Pylearn2 (http://deeplearning.net/software/pylearn2/)
- Lasagne (https://lasagne.readthedocs.org/en/latest/)
- Keras (http://keras.io)
Summary
In this chapter, you have learned about the most important concepts behind multi-layer artificial neural networks, which are currently the hottest topic in machine learning research. In Chapter 2, Training Machine Learning Algorithms for Classification, we started our journey with simple single-layer neural network structures and now we have connected multiple neurons to a powerful neural network architecture to solve complex problems such as handwritten digit recognition. We demystified the popular backpropagation algorithm, which is one of the building blocks of many neural network models that are used in deep learning. After learning about the backpropagation algorithm, we were able to update the weights of such a complex neural network. We also added useful modifications such as mini-batch learning and an adaptive learning rate that allows us to train a neural network more efficiently.