Handwritten digit recognition using CNNs

For sure, we begin with exploration of the handwritten digit dataset.

Get started with exploring MNIST

The MNIST dataset from http://yann.lecun.com/exdb/mnist/ consists of a training set of 60,000 samples, and a testing set of 10,000 samples. As said previously, images were originally taken from the NIST, and then centered and resized to the same height and width (28 * 28).

Rather than handling the ubyte files, train-images-idx3-ubyte.gz and train-labels-idx1-ubyte.gz in the preceding website and merge them, we use a dataset that is well-formatted from the Kaggle competition Digit Recognizer, https://www.kaggle.com/c/digit-recognizer/. We can download the training dataset, train.csv directly from https://www.kaggle.com/c/digit-recognizer/data. It is the only labeled dataset provided in the site, and we will use it to train classification models, evaluate models and do predictions. Now let's load it up:

> data <- read.csv ("train.csv")
> dim(data)
[1] 42000 785

We have 42,000 labeled samples available, and each sample has 784 features, which means each digit image has 784 (28 * 28) pixels. Take a look at the label and the first 5 features (pixels) for each of the first 6 data samples:

> head(data[1:6])
label pixel0 pixel1 pixel2 pixel3 pixel4
1 1 0 0 0 0 0
2 0 0 0 0 0 0
3 1 0 0 0 0 0
4 4 0 0 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 0 0

The target label ranging from 0 to 9 denotes 10 digits:

> unique(unlist(data[1]))
[1] 1 0 4 7 3 5 8 9 2 6

The pixel variable ranges from 0 to 255, representing the brightness of the pixel, for example 0 means black and 255 stands for white:

> min(data[2:785])
[1] 0
> max(data[2:785])
[1] 255

Now let's take a look at two samples, first, the fourth image:

> sample_4 <- matrix(as.numeric(data[4,-1]), nrow = 28, byrow = TRUE)
> image(sample_4, col = grey.colors(255))

Where we reshaped the feature vector of length 784 into a matrix of 28 * 28.

Second, the 7th image:

> sample_7 <- matrix(as.numeric(data[7,-1]), nrow = 28, byrow = TRUE)
> image(sample_7, col = grey.colors(255))

The result is as follows:

We noticed that the images are rotated 90 degrees to the left. To better view the images, a rotation of 90 degrees clockwise is required. We simply need to reserve elements in each column of an image matrix:

> # Rotate the matrix by reversing elements in each column
> rotate <- function(x) t(apply(x, 2, rev))

Now visualize the rotated images:


> image(rotate(sample_4), col = grey.colors(255))

> image(rotate(sample_7), col = grey.colors(255))

After viewing what the data and images behind look like, we do more exploratory analysis on the labels and features. Firstly, because it is a classification problem, we inspect whether the classes from the data are balanced or unbalanced as a good practice. But before doing so, we should transform the label from integer to factor:

> # Transform target variable "label" from integer to factor, in order to perform classification
> is.factor(data$label)
[1] FALSE
> data$label <- as.factor(data$label)
> is.factor(data$label)
[1] TRUE

Now, we can summarize the label distribution in counts:

> summary(data$label)
0 1 2 3 4 5 6 7 8 9
4132 4684 4177 4351 4072 3795 4137 4401 4063 4188

Or combined with proportion (%):

> proportion <- prop.table(table(data$label)) * 100
> cbind(count=table(data$label), proportion=proportion)
count proportion
0 4132 9.838095
1 4684 11.152381
2 4177 9.945238
3 4351 10.359524
4 4072 9.695238
5 3795 9.035714
6 4137 9.850000
7 4401 10.478571
8 4063 9.673810
9 4188 9.971429

Classes are balanced.

Now, we explore the distribution of features, the pixels. As an example, we take the 4 pixels from the central 2*2 block (that is, pixel376, pixel377, pixel404, and pixel405) in each image and display the histogram for each of the 9 digits:

> central_block <- c("pixel376", "pixel377", "pixel404", "pixel405")
> par(mfrow=c(2, 2))
> for(i in 1:9) {
+ hist(c(as.matrix(data[data$label==i, central_block])),
+ main=sprintf("Histogram for digit %d", i),
+ xlab="Pixel value")
+ }

The resulting pixel brightness histograms for digit 1 to 4 are displayed respectively, as follows:

Histograms for digits 5 to 9:

And that for digit 9:

The brightness of central pixels is distributed differently among 9 digits. For instance, the majority of the central pixels are bright, as digit 8 is usually written in a way that strokes go through the center; while digit 7 is not written in this way, hence most of the central pixels are dark. Pixels taken from other positions can also be distinctly distributed among different digits.

The exploratory analysis we just conducted helps move us forward with building classification models based on pixels.

First attempt – logistic regression

We start off with probably the most basic classifier, the logistic regression, to be specific multinomial logistic regression as it is a multiclass case. It is a probabilistic linear classifier parameterized by a weight matrix W (also called coefficient matrix) and a bias (also called intercept) vector b. And it maps an input vector x to a set of probabilities P(y=1), P(y=2),. . ., P(y-K) for K possible classes.

A multinomial logistic regression for two possible classes can be represented graphically as follows:

Suppose x is n-dimension, then the weight matrix W is of size n by K with each column W_k representing the coefficients associated with class k; similarly, the bias vector b is of length K, with each element b_k served as the bias for class k. For simplicity, the bias b can be viewed as an additional row in the weight matrix W. So the probability of x being class k can be expressed mathematically as:

Where softmax() denotes the softmax function and that is why multinomial logistic regression is often called softmax regression.

Given a set of training samples where , the optimal model w is obtained by minimizing the cost (also called log loss), which is defined as:

where

As usual we resort to gradient descent, an iterative optimization algorithm, to solve for the optimal w. In each iteration, w moves a step that is proportional to the negative derivative Δw of the objective function at the current point. That is, w:=w – ƞΔw, where ƞ is the learning rate. Each column Δw_k of Δw can be computed as:

The well trained model, the optimal w will be used to classify a new sample x' by:

Armed with the mechanics of the multinomial logistic regression we just reviewed, we can then apply it as the first solution to our digit classification project.

We first split the dataset into two subsets for training and testing respectively using the caret package.

caret stands for classification and regression training. The package is designed to facilitate the process for training and evaluating models. It contains tools and methods for data splitting, data pre-processing, feature selection and model tuning. Documentation and a full list of functions can be found in https://cran.r-project.org/web/packages/caret/caret.pdf.

Install and import package caret:

> if (!require("caret"))
+ install.packages("caret")
> library (caret)
Loading required package: lattice
Loading required package: ggplot2

We first split the data into two partitions, 75% for training and 25% for testing, using the createDataPartition function:

> set.seed(42)
> train_perc = 0.75
> train_index <- createDataPartition(data$label, p=train_perc, list=FALSE)
> data_train <- data[train_index,]
> data_test <- data[-train_index,]

To ensure the experiments are reproducible, it is always a good practice to pick a seed from the random number generator.

Then, we implement the multinomial logistic regression model using the nnet package. The package contains functions for feed-forward single-layer neural networks as well as multinomial logistic regression models. More details can be found in https://cran.r-project.org/web/packages/nnet/nnet.pdf:

> library(nnet)
> # Multinomial logistic regression
> model_lr <- multinom(label ~ ., data=data_train, MaxNWts=10000,
decay=5e-3, maxit=100)
# weights: 7860 (7065 variable)
initial value 72538.338185
iter 10 value 17046.804658
iter 20 value 11166.225504
iter 30 value 9514.340319
iter 40 value 8819.724147
iter 50 value 8405.001712
iter 60 value 8164.997939
iter 70 value 7983.427139
iter 80 value 7897.005940
iter 90 value 7831.663204
iter 100 value 7730.047242
final value 7730.047242
stopped after 100 iterations

We fit a multinomial logistic regression model on the training subset, with parameters which include:

MaxNWts=10000: It allows, at most, 10,000 weights. In our case, there are (784 dimensions + 1 bias) * 10 classes = 7850 elements in the weight matrix w
decay=5e-3: The regularization strength, the weight decay is 0.005
maxit=100: The maximum number of iterations is set to be 100

The error value is printed for every 10 iterations, and it is decreasing. The model converges as the maximum number of iterations is reached. Then we use the trained model to predict the classes of the testing samples:

> prediction_lr <- predict(model_lr, data_test, type = "class")

Take a look at the prediction results of the first five samples:

> prediction_lr[1:5]
[1] 1 0 7 5 8
Levels: 0 1 2 3 4 5 6 7 8 9

And their true values are:

> data_test$label[1:5]
[1] 1 0 7 5 8
Levels: 0 1 2 3 4 5 6 7 8 9

We can also obtain the confusion matrix by:

> cm_lr = table(data_test$label, prediction_lr)
> cm_lr

And the classification accuracy:

> accuracy_lr = mean(prediction_lr == data_test$label)
> accuracy_lr
[1] 0.8935886

89.4% for the first try. Not bad! We could definitely do better by tweaking the model parameters, such as decay and maxit. But our focus is for a more advanced model that learns the underneath patterns better. So we move on with the second solution, the feed-forward neural networks with a single hidden layer.

Going from logistic regression to single-layer neural networks

Basically, logistic regression is a feed-forward neural network without a hidden layer, where the output layer directly connects with the input layer. In other words, logistic regression is a single neuron that maps the input to the output layer. Theoretically, the neural networks with an additional hidden layer between the input and output layer should be able to learn more about the relationship underneath.

A single-layer neural network for two possible classes can be represented graphically as follows:

Suppose x is n-dimension, and there are H hidden units in the hidden layer, then the weight matrix w⁽¹⁾connecting the input layer to the hidden layer is of size n by H with each column w_h⁽¹⁾ representing the coefficients associated with the h-th hidden unit. So, the output (also called activation) of the h-th hidden unit a_h⁽²⁾ can be expressed mathematically as:

For example, for the outputs of the first, second and the last hidden unit:

where f(z) is an activation function. Typical choices for the activation function in simple networks include logistic function (more often called sigmoid function) and tanh function (which can be considered a re-scaled version of logistic function):

Plots of these two functions are as follows:

We will be using the logistic function in our single-layered networks for now.

For a case of K possible classes, the weight matrix w⁽²⁾ connecting the hidden layer to the output layer is of size H by K. Each column w_k⁽²⁾ represents the coefficients associated with class k. The input to the output layer is the output of the hidden layer a⁽²⁾ = {a₁⁽²⁾, a₂⁽²⁾, ,...,a_I⁽²⁾ }, the probability of x being class k can be expressed mathematically as (for consistency, we denote it as a_k⁽³⁾):

Similarly, given m training samples, to train the neural network, we learn all weights w = {w⁽¹⁾, w⁽²⁾}using gradient descent with the goal of minimizing the mean squared error cost J(w).

Computation of the gradients Δw can be realized through the backpropagation algorithm. The idea of the backpropagation algorithm is the following, we first travel through the network and compute all outputs of the hidden layers and output layer; then moving backward from the last layer, we calculate how much each node contributed to the error in the final output and propagate it back to the previous layers. In our single-layer network, the detailed steps are:

Compute a⁽²⁾ for the hidden layer and feed them to the output layer to compute the outputs a⁽³⁾
For the output layer, compute the derivative of the cost function of one sample j(W) with regards to each unit, , or , rewritten for the entire layer
For the hidden layer, we compute the error term δ⁽²⁾ based on a weighted average of
Compute the gradients applying the chain rule:

We repeatedly update all weights by taking these steps until the cost function converges.

After a brief review of the single-layer network, we can then apply it as the second solution to our digit classification project.

Again, we use the nnet package to implement our single-layer network:

> model_nn <- nnet(label ~ ., data=data_train, size=50, maxit=300, MaxNWts=100000, decay=1e-4)
# weights: 39760
initial value 108597.598656
iter 10 value 27708.286001
iter 20 value 16027.005297
iter 30 value 14058.141050
iter 40 value 12615.442747
iter 50 value 11793.700937
iter 60 value 11026.672273
iter 70 value 10654.855058
iter 80 value 10193.580947
iter 90 value 9854.836168
iter 100 value 9544.973159
iter 110 value 9307.192737
iter 120 value 9043.028253
iter 130 value 8845.069307
iter 140 value 8686.707561
iter 150 value 8525.104362
iter 160 value 8281.609223
iter 170 value 8140.051273
iter 180 value 7998.721024
iter 190 value 7854.388240
iter 200 value 7712.459027
iter 210 value 7636.945553
iter 220 value 7557.675909
iter 230 value 7449.854506
iter 240 value 7355.021651
iter 250 value 7259.186906
iter 260 value 7192.798089
iter 270 value 7055.027833
iter 280 value 6957.926522
iter 290 value 6866.641511
iter 300 value 6778.342997
final value 6778.342997
stopped after 300 iterations

We fit the model with parameters including:

size=50: There are 50 hidden units in the hidden layer.
MaxNWts=100000: It allows at most 100,000 weights. In our case, there are (784 input dimensions + 1 bias) * 50 hidden units = 39,250 elements in the weight matrix w⁽¹⁾ and (50 hidden units + 1 bias) * 10 output units = 510 elements in the weight matrix w⁽¹⁾. So there are 39,760 weights in total.
decay=1e-4: The regularization strength, the weight decay is 0.0001.
maxit=300: The maximum number of iterations is set to be 300.

We apply the trained network model on the testing set:

> prediction_nn <- predict(model_nn, data_test, type = "class")
> cm_nn = table(data_test$label, prediction_nn)
> cm_nn
prediction_nn
0 1 2 3 4 5 6 7 8 9
0 987 0 3 3 2 11 10 7 5 5
1 0 1134 9 6 0 2 0 5 11 4
2 14 9 918 31 11 11 7 15 22 6
3 3 1 17 966 0 41 3 14 24 18
4 4 3 8 2 929 4 11 11 5 41
5 12 2 6 17 5 851 15 9 26 5
6 10 1 14 0 9 14 970 0 15 1
7 4 4 23 6 5 2 0 1010 7 39
8 5 15 9 18 5 31 9 4 912 7
9 11 2 1 20 52 4 0 36 8 913
> accuracy_nn = mean(prediction_nn == data_test$label)
> accuracy_nn
[1] 0.9135944

Better than our first attempt! Can we do better with deep learning models, say intuitively more hidden layers? Sure.

Adding more hidden layers to the networks

We have just achieved 91.3% accuracy with a single-layer neural network model. Theoretically, we can obtain a better one with more than one hidden layer. As an example, we provide a solution of a deep neural network model with two hidden layers:

Weight optimization in feed-forward deep neural networks is also realized through the backpropagation algorithm, which is identical to single-layer networks. However, the more layers, the higher the computation complexity, and the slower the model convergence. One way to accelerate the weight optimization, is to use a more computational efficient activation function. The most popular one in recent years is the rectified linear unit (ReLU):

A plot of the ReLU function is as follows:

Thanks to the properties of its derivative, which is , there are two main advantages of using the ReLU activation function over sigmoid:

Faster learning because of constant value of relu'(z), compared to that of the logistic function .
Less likely to have the vanishing gradient problem, exponential decrease of gradient, which can be found in networks with multiple stacked sigmoid layers. As we multiply the derivative of the activation function when calculating errors δ for each layer, and the maximal value of sigmoid'(z) is ¼, the gradients will decrease exponentially as we stack more and more sigmoid layers.

The nnet package we used in previous sections is (by now) only capable of modeling a single-layer network. In this chapter, we use the MXNet package to implement deep neural networks with multiple hidden layers. MXNet (https://mxnet.incubator.apache.org/) is a deep learning framework that supports programming languages include R, Scala, Python, Julia, C++, and Perl. It is developed by the DMLC (http://dmlc.ml/) team, a group of experts collaborating on open-source machine learning projects. It is portable and can scale to multiple CPUs, multiple GPUs and multiple machines, for example, in the cloud. Most importantly, it allows us to flexibly and efficiently construct state-of-the-art deep learning models, including deep neural networks, CNNs and RNNs.

Let's install MXNet first:

> cran <- getOption("repos")
> cran["dmlc"] <- "https://s3-us-west-2.amazonaws.com/apache-mxnet/R/CRAN/"
> options(repos = cran)
> if (!require("mxnet"))
install.packages("mxnet")

Now we can import MXNet and convert the data into the format preferred by the neural network models in MXNet:

> require(mxnet)
> data_train <- data.matrix(data_train)
> data_train.x <- data_train[,-1]
> data_train.x <- t(data_train.x/255)
> data_train.y <- data_train[,1]

Note we scale the input features to a range from 0 to 1, by dividing the maximal possible value 255. Otherwise, the deep neural networks may be skewed towards some features and such skewness will accumulate over layers.

Now that the training dataset is ready, we can start constructing the network by defining its architecture as follows:

> data <- mx.symbol.Variable("data")
> fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128)
> act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
> fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=64)
> act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu")
> fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=10)
> softmax <- mx.symbol.SoftmaxOutput(fc3, name="sm")

In the MXNet's Symbol API, we represent the network in the data type symbol. We begin with the input layer data, the input data, and follow up with the first hidden layer fc1 with 128 nodes, which fully connects with the input layer. We then attach the ReLU function to fc1 and output the activations act1 for this layer. Similarly, we chain another hidden layer fc2, with 64 nodes this time, and output ReLU-based activates act2. Finally, we end up with the output layer with a softmax function, generating 10 probabilities corresponding to 10 classes. The overall structure looks like this:

After building the bone, it is time to train the model. We can choose our computation device, CPU and/or GPU—here is a CPU example:

> devices <- mx.cpu()

Before training, don't forget to set the random seed to make the modeling process reproducible:

> mx.set.seed(42)
> model_dnn <- mx.model.FeedForward.create(softmax, X=data_train.x,
y=data_train.y, ctx=devices, num.round=30, array.batch.size=100,
learning.rate=0.01, momentum=0.9, eval.metric=mx.metric.accuracy,
initializer=mx.init.uniform(0.1),
epoch.end.callback=mx.callback.log.train.metric(100))
Start training with 1 devices
[1] Train-accuracy=0.724793650793651
[2] Train-accuracy=0.904715189873417
[3] Train-accuracy=0.925537974683544
[4] Train-accuracy=0.939936708860759
[5] Train-accuracy=0.950379746835443
[6] Train-accuracy=0.95873417721519
[7] Train-accuracy=0.96509493670886
[8] Train-accuracy=0.969905063291139
[9] Train-accuracy=0.974303797468355
[10] Train-accuracy=0.977784810126584
[11] Train-accuracy=0.980696202531648
[12] Train-accuracy=0.983164556962027
[13] Train-accuracy=0.985284810126584
[14] Train-accuracy=0.987405063291141
[15] Train-accuracy=0.988924050632913
[16] Train-accuracy=0.990727848101267
[17] Train-accuracy=0.992088607594938
[18] Train-accuracy=0.993227848101268
[19] Train-accuracy=0.994398734177217
[20] Train-accuracy=0.995284810126584
[21] Train-accuracy=0.995854430379748
[22] Train-accuracy=0.996835443037975
[23] Train-accuracy=0.997183544303798
[24] Train-accuracy=0.997848101265823
[25] Train-accuracy=0.998164556962026
[26] Train-accuracy=0.998575949367089
[27] Train-accuracy=0.998924050632912
[28] Train-accuracy=0.999177215189874
[29] Train-accuracy=0.999367088607595
[30] Train-accuracy=0.999525316455696

We just fit the model with hyperparameters including:

num.round = 30: The maximum number of iterations is set to be 30.
array.batch.size = 100: The batch size of the mini-batch gradient descent is 100. As a variation of a stochastic gradient descent, the mini-batch gradient descent algorithm calculates costs and gradients by small batches, instead of individual training samples. Hence, it is computationally more efficient and allows faster model convergence. As a result, the mini-batch gradient descent is more commonly used in training deep neural networks.
learning.rate = 0.01: The learning rate is 0.01.
momentum=0.9: In general, the cost function of deep architectures has the form of one or more shallow ravines (local minima) leading to the global optimum. Momentum as seen in the physical law of motion is employed to avoid getting stuck in sub-optimum and make the convergence faster. With momentum, weights are updated as follows:

where the left and right v is the previous and current velocity respectively, and γ ∈ (0,1] is the momentum factor determining how much of the previous velocity is incorporated into the current one.

eval.metric=mx.metric.accuracy: It uses classification accuracy as the evaluation metric
initializer=mx.init.uniform(0.1): Initial weights are randomly generated from the uniform distribution ranging from 0 to 1, so as to lower the chances of the weight exploding and vanishing in the deep network

After the model is trained, let's see how it performs on the testing set. First, remember to conduct the same pre-processing on the test dataset:

> data_test.x <- data_test[,-1]
> data_test.x <- t(data_test.x/255)

Then, predict the testing cases and evaluate the performance:

> prob_dnn <- predict(model_dnn, data_test.x)
> prediction_dnn <- max.col(t(prob_dnn)) - 1
> cm_dnn = table(data_test$label, prediction_dnn)
> cm_dnn
   prediction_dnn
       0    1    2    3    4    5    6    7    8    9
  0 1041    0    2    0    0    1    3    0    8    1
  1    0 1157    3    1    1    0    1    3    1    0
  2    2    1  993    3    3    1    2   13    5    2
  3    1    3   14 1033    1   13    0    5   14    6
  4    0    2    1    0  991    0    4    4    1   12
  5    4    2    3   12    3  892    4    3    6    8
  6   10    0    1    0    3    4  988    0    4    0
  7    0    5    9    1    2    0    0 1116    2    1
  8    4    8    3    5    0    8    3    2 1020   12
  9    1    1    0    4   13    3    0   16    2  957
> accuracy_dnn = mean(prediction_dnn == data_test$label)
> accuracy_dnn
[1] 0.9704706

By adding one more hidden layer, accuracy is improved from 91.4% to 97.0%! Since each hidden layer in a deep neural network provides representations of the data at a certain level, can we simply conclude that the more hidden layers (such as 100, 1,000, 10,000...), the more underneath patterns are discovered, the better the classification accuracy? It might be true if we have plentiful resources and time to enable computation and to make sure overfitting does not occur with such complex networks. Is there any way where we can extract richer and more informative representations than by simply chaining more hidden layers, and at the same time, not excessively grow our networks? The answer is CNNs.

Extracting richer representation with CNNs

Although regular hidden layers (we also call them fully connected layers) do the job of obtaining representations at certain levels, these representations might be able to help us differentiate between images of different classes. We need to extract richer and distinguishable representations that, for example, make a "9" a "9", a "4" a "4", or a cat a cat, a dog a dog. We resort to CNNs as variants of multi-layered neural networks which are biologically inspired by the human visual cortex. Basically, CNNs take inspiration from the following two neuroscience findings:

The visual cortex has a complex system of neuronal cells that are sensitive to specific sub-regions of the visual field, called the receptive field. For instance, some cells respond only in the presence of vertical edges; some cells fire only when exposed to horizontal edges; and some react more strongly when shown edges of a certain orientation. The cells are organized together to produce the entire visual perception, while each individual cell is specialized in a specific component.
Simple cells respond only when those edge-like patterns are presented within their receptive sub-regions. More complex cells are sensitive to larger sub-regions, and as a result, are less variant to the local position of those edge-like patterns in the entire visual field.

Similarly, CNNs classify images by first deriving low-level representations, local edges and curves, then by composing higher-level representations, overall shape and contour, through a series of low-level representations. CNNs are well suited to exploiting strong and unique features that differentiate between images.

In general, CNNs take in an image, pass it through a sequence of convolutional layers, non-linear layers, pooling layers and fully connected layers, and finally output the probabilities of each possible class. We now look at each type of layer individually, in detail.

The convolutional layer is the first layer in a CNN. It simulates the way neuronal cells respond to receptive fields by applying a convolutional operation to the input. To be specific, it computes the dot product between the weights of the convolutional layer and a small region they are connected to in the input layer. The small region is the receptive field, and the weights can be viewed as the values on a filter. As the filter slides from the beginning to the end of the input layer, the dot product between the weights and current receptive field is computed. A new layer called feature map is obtained after convolving over all sub-regions. Take a look at the following example:

Layer m has five nodes and the filter has three units [w₁, w₂, w₃]. We compute the dot product between the filter and the first three nodes in layer m and obtain the first node in the feature map; then, we compute the dot product between the filter and the middle three nodes and generate the second node; finally, we obtain the third node resulting from the last three nodes in layer m.

Another example that helps us better understand how the convolutional layer works on images is as follows:

A 3*3 filter is sliding around a 5*5 image from the top left sub-region to the bottom right. For every sub-region, the dot product is computed with the filter. A 3*3 feature map is generated as a result.

Convolutional layers are actually used to extract features, such as edges and curves. The output pixel in the feature map will be of high value, if the corresponding receptive field contains an edge or curve specified by the filter. For instance, in the preceding example, the filter portrays a backslash-shape diagonal edge, the receptive field in the blue rectangle contains a similar curve and hence, the highest intensity 3 (1*1 + 1* 1 + 1*1 = 3) is produced; however, the receptive field in the bottom left corner does not contain such a shape, which results in a pixel of value 1. The convolutional layer acts as a curve detector, mimicking the way our visual cells work.

Remember in the preceding case, we only applied one filter and generated one feature map, which indicates how well the shape in the input image resembles the curve represented in the filter. To achieve a richer representation of the data, we can employ more filters, such as horizontal, vertical curve, 30-degree, or right-angle shape, so that the hidden layer composed of feature maps can detect more patterns. Additionally, stacking many convolutional layers can produce higher-level representations, such as overall shape and contour. To ensure the strong spatially local patterns are caught, each filter in a layer is only responsive to the corresponding receptive fields. Chaining more layers results in larger receptive fields which capture more global patterns.

Right after each convolutional layer, we often apply a non-linear layer (also called activation layer, as we mentioned), in order to introduce non-linearity, obviously. This is because only linear operations (multiplication and addition) are conducted in the convolutional layer. And a neural network with only linear hidden layers would behave just like a single-layer perceptron, regardless of how many layers. Again, ReLu is the most popular candidate for the non-linear layer in deep neural networks.

Normally, after obtaining features via one or more pairs of convolutional layers and non-linear layers, we can use the output for classification, for example applying a softmax layer in our multiclass case. Let's do some math, suppose we apply 20 5*5 filters in the first convolutional layer, then the output of this layer will be of size 20 * (28 - 5 + 1) * (28 - 5 + 1) = 20 * 24 * 24 = 11520, which means the number of features as inputs for the next layer becomes 11,520 from 784; we then apply 50 5*5 filters in the second convolutional layer, the size of the output grows to 50 * 20 * (24 - 5 + 1) * (24 - 5 + 1) = 400,000, which is high-dimensional compared to our initial size of 784. We can see that the dimension increases dramatically with each convolutional layer before the softmax layer for classification. This can easily lead to overfitting, not to mention the large number of weights to be trained in the corresponding non-linear layer.

To address the dimension growth issue, we often employ a pooling layer (also called downsampling layer) after a pair of convolutional and non-linear layers. As its name implies, it aggregates statistics of features by sub-regions to generate much lower dimensional features. Typical pooling methods include max pooling and mean pooling, which take the max values and mean values over all non-overlapping sub-regions, respectively. For example, we apply a 2*2 max pooling filter on a 4*4 feature map and output a 2*2 one:

Besides reducing overfitting with lower dimensional output, the pooling layer aggregating statistics over regions has another advantage—translation invariant. It means the output does not change, if the input image undergoes a small amount of translation. For example, suppose we shift the input image to a couple of pixels left or right, up or down, as long as the highest pixels remain the same in sub-regions, the output of the max pooling layer is still the same. In another words, the prediction becomes less position-sensitive with pooling layers.

Putting these three types of convolutional-related layers together, along with fully connected layers, we can structure our CNN model as follows:

It starts with the first convolutional layer, ReLu non-linear layer and pooling layer. Here, we use 20 5*5 convolutional filters, and a 2*2 max pooling filter:

> # first convolution
> conv1 <- mx.symbol.Convolution(data=data, kernel=c(5,5),
num_filter=20)
> act1 <- mx.symbol.Activation(data=conv1, act_type="relu")
> pool1 <- mx.symbol.Pooling(data=act1, pool_type="max",
+ kernel=c(2,2), stride=c(2,2))

It follows with the second convolutional, ReLu non-linear and pooling layer, where 50 5*5 convolutional filters and a 2*2 max pooling filter are used:

> # second convolution
> conv2 <- mx.symbol.Convolution(data=pool1, kernel=c(5,5),
num_filter=50)
> act2 <- mx.symbol.Activation(data=conv2, act_type="relu")
> pool2 <- mx.symbol.Pooling(data=act2, pool_type="max",
+ kernel=c(2,2), stride=c(2,2))

Now that we extract rich representations of the input images by detecting edges, curves and shapes, we move on with the fully connected layers. But before doing so, we need to flatten the resulting feature maps from previous convolution layers:

> flatten <- mx.symbol.Flatten(data=pool2)

In the fully connected section, we apply two ReLu hidden layers with 500 and 10 units respectively:

> # first fully connected layer
> fc1 <- mx.symbol.FullyConnected(data=flatten, num_hidden=500)
> act3 <- mx.symbol.Activation(data=fc1, act_type="relu")
> # second fully connected layer
> fc2 <- mx.symbol.FullyConnected(data=act3, num_hidden=10)

Finally, the softmax layer producing outputs for 10 classes:

> # softmax output
> softmax <- mx.symbol.SoftmaxOutput(data=fc2, name="sm")

Now, the bone is constructed. Time to set a random seed and start training the model:

We need to first reshape the matrix, data_train.x into an array as required by the convolutional layer in MXNet:

> train.array <- data_train.x
> dim(train.array) <- c(28, 28, 1, ncol(data_train.x))
> mx.set.seed(42)
> model_cnn <- mx.model.FeedForward.create(softmax, X=train.array,
y=data_train.y, ctx=devices, num.round=30,
array.batch.size=100, learning.rate=0.05,
momentum=0.9, wd=0.00001,
eval.metric=mx.metric.accuracy,
epoch.end.callback=mx.callback.log.train.metric(100))
Start training with 1 devices
[1] Train-accuracy=0.306984126984127
[2] Train-accuracy=0.961898734177216
[3] Train-accuracy=0.981139240506331
[4] Train-accuracy=0.987151898734179
[5] Train-accuracy=0.990348101265825
[6] Train-accuracy=0.992689873417723
[7] Train-accuracy=0.994493670886077
[8] Train-accuracy=0.995822784810128
[9] Train-accuracy=0.995601265822786
[10] Train-accuracy=0.997246835443039
[11] Train-accuracy=0.997341772151899
[12] Train-accuracy=0.998006329113925
[13] Train-accuracy=0.997626582278482
[14] Train-accuracy=0.998069620253165
[15] Train-accuracy=0.998765822784811
[16] Train-accuracy=0.998449367088608
[17] Train-accuracy=0.998765822784811
[18] Train-accuracy=0.998955696202532
[19] Train-accuracy=0.999746835443038
[20] Train-accuracy=0.999841772151899
[21] Train-accuracy=0.999905063291139
[22] Train-accuracy=1
[23] Train-accuracy=1
[24] Train-accuracy=1
[25] Train-accuracy=1
[26] Train-accuracy=1
[27] Train-accuracy=1
[28] Train-accuracy=1
[29] Train-accuracy=1
[30] Train-accuracy=1

Besides those hyperparameters we used in the previous deep neural network model, we fit the CNN model with L2 regularization weight decay wd = 0.00001, which adds penalties for large weights in order to avoid overfitting.

Again, training of the CNN model is no different to other networks. Optimal weights are obtained through a backpropagation algorithm.

After the model is trained, let's see how it performs on the testing set. First, remember to conduct the same pre-processing on the test dataset:

> test.array <- data_test.x
> dim(test.array) <- c(28, 28, 1, ncol(data_test.x))

Predict the testing cases and evaluate the performance:

> prob_cnn <- predict(model_cnn, test.array)
> prediction_cnn <- max.col(t(prob_cnn)) - 1
> cm_cnn = table(data_test$label, prediction_cnn)
> cm_cnn
prediction_cnn
0 1 2 3 4 5 6 7 8 9
0 1051 0 1 0 0 1 1 0 2 0
1 0 1161 0 0 0 1 1 3 1 0
2 0 0 1014 4 0 0 0 7 0 0
3 0 0 2 1075 0 6 0 2 3 2
4 0 0 0 0 1000 0 4 2 2 7
5 1 0 0 4 0 923 3 0 3 3
6 3 0 0 0 0 0 1006 0 1 0
7 0 1 2 0 3 0 0 1129 1 0
8 3 3 1 1 2 5 1 0 1043 6
9 2 0 2 0 3 3 1 2 0 984
> accuracy_cnn = mean(prediction_cnn == data_test$label)
> accuracy_cnn
[1] 0.9893313

Our CNN model further boosts the accuracy to close to 99%!

We can also view the network structure by:

> graph.viz(model_cnn$symbol)

Let's do some more inspection to make sure we get things right. We start with the learning curving, for example the classification performance of the model on both the training set and testing set over the number of training iterations. In general, it is a good practice to plot the learning curve where we can visualize whether overfitting or underfitting issues occur:

> data_test.y <- data_test[,1]
> logger <- mx.metric.logger$new()
> model_cnn <- mx.model.FeedForward.create(softmax, X=train.array,
y=data_train.y,eval.data=list(data=test.array,
label=data_test.y), ctx=devices, num.round=30,
array.batch.size=100, learning.rate=0.05,
momentum=0.9, wd=0.00001,eval.metric=
mx.metric.accuracy, epoch.end.callback =
mx.callback.log.train.metric(1, logger))
Start training with 1 devices
[1] Train-accuracy=0.279936507936508
[1] Validation-accuracy=0.912857142857143
[2] Train-accuracy=0.959462025316456
[2] Validation-accuracy=0.973523809523809
[3] Train-accuracy=0.979841772151899
[3] Validation-accuracy=0.980666666666666
[4] Train-accuracy=0.986677215189875
[4] Validation-accuracy=0.983428571428571
[5] Train-accuracy=0.990822784810129
[5] Validation-accuracy=0.981809523809523
[6] Train-accuracy=0.992626582278482
[6] Validation-accuracy=0.983904761904761
[7] Train-accuracy=0.993322784810128
[7] Validation-accuracy=0.986
[8] Train-accuracy=0.995474683544305
[8] Validation-accuracy=0.987619047619047
[9] Train-accuracy=0.996487341772153
[9] Validation-accuracy=0.983904761904762
[10] Train-accuracy=0.995949367088608
[10] Validation-accuracy=0.984761904761904
[11] Train-accuracy=0.997310126582279
[11] Validation-accuracy=0.985142857142856
[12] Train-accuracy=0.997658227848102
[12] Validation-accuracy=0.986857142857142
[13] Train-accuracy=0.997848101265824
[13] Validation-accuracy=0.984095238095238
[14] Train-accuracy=0.998006329113924
[14] Validation-accuracy=0.985238095238094
[15] Train-accuracy=0.998607594936709
[15] Validation-accuracy=0.987619047619047
[16] Train-accuracy=0.99863924050633
[16] Validation-accuracy=0.987428571428571
[17] Train-accuracy=0.998987341772152
[17] Validation-accuracy=0.985142857142857
[18] Train-accuracy=0.998765822784811
[18] Validation-accuracy=0.986285714285713
[19] Train-accuracy=0.999240506329114
[19] Validation-accuracy=0.988761904761905
[20] Train-accuracy=0.999335443037975
[20] Validation-accuracy=0.98847619047619
[21] Train-accuracy=0.999841772151899
[21] Validation-accuracy=0.987809523809523
[22] Train-accuracy=0.99993670886076
[22] Validation-accuracy=0.990095238095237
[23] Train-accuracy=1
[23] Validation-accuracy=0.989999999999999
[24] Train-accuracy=1
[24] Validation-accuracy=0.989999999999999
[25] Train-accuracy=1
[25] Validation-accuracy=0.990190476190476
[26] Train-accuracy=1
[26] Validation-accuracy=0.990190476190476
[27] Train-accuracy=1
[27] Validation-accuracy=0.990095238095237
[28] Train-accuracy=1
[28] Validation-accuracy=0.990095238095237
[29] Train-accuracy=1
[29] Validation-accuracy=0.990095238095237
[30] Train-accuracy=1
[30] Validation-accuracy=0.990190476190475

We can get the performance on the training set after each round of training:

> logger$train
[1] 0.2799365 0.9594620 0.9798418 0.9866772 0.9908228 0.9926266 0.9933228 0.9954747 0.9964873 0.9959494 0.9973101
[12] 0.9976582 0.9978481 0.9980063 0.9986076 0.9986392 0.9989873 0.9987658 0.9992405 0.9993354 0.9998418 0.9999367
[23] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000

As well as the performance on the testing set after each round of training:

> logger$eval
[1] 0.9128571 0.9735238 0.9806667 0.9834286 0.9818095 0.9839048 0.9860000 0.9876190 0.9839048 0.9847619 0.9851429
[12] 0.9868571 0.9840952 0.9852381 0.9876190 0.9874286 0.9851429 0.9862857 0.9887619 0.9884762 0.9878095 0.9900952
[23] 0.9900000 0.9900000 0.9901905 0.9901905 0.9900952 0.9900952 0.9900952 0.9901905

The learning curve can be visualized by the following codes:

> plot(logger$train,type="l",col="red", ann=FALSE)
> lines(logger$eval,type="l", col="blue")
> title(main="Learning curve")
> title(xlab="Iterations")
> title(ylab="Accuary")
> legend(20, 0.5, c("training","testing"), cex=0.8,
col=c("red","blue"), pch=21:22, lty=1:2);

And we will get:

The learning curve indicates little chance of overfitting or underfitting.

Since the model works great, why don't we visualize the output of the convolutional layers of the trained model so that we can get a better understanding of CNNs. Let's use the first two samples in the testing set as an example. They are 1 and 0:

> par(mfrow=c(1,2))
> test_1 <- matrix(as.numeric(data_test[1,-1]), nrow = 28,
byrow = TRUE)
> image(rotate(test_1), col = grey.colors(255))
> test_2 <- matrix(as.numeric(data_test[2,-1]), nrow = 28,
byrow = TRUE)
> image(rotate(test_2), col = grey.colors(255))

For our reference, they are displayed as:

To visualize the activation of the convolutional layers, we first create an executor (can be loosely viewed as a copy of our trained CNN model) by grouping all of the layers with activations:

> layerss_for_viz <- mx.symbol.Group(mx.symbol.Group(c(conv1, act1, pool1, conv2, act2, pool2, fc1, fc2)))
> executor <- mx.simple.bind(symbol=layerss_for_viz,
data=dim(test.array), ctx=mx.cpu())

Now, update the weights in the executor with those in the trained model:

> mx.exec.update.arg.arrays(executor, model_cnn$arg.params, match.name=TRUE)
> mx.exec.update.aux.arrays(executor, model_cnn$aux.params,
match.name=TRUE)

And apply the executor on the testing set by making a feed-forward pass:

> mx.exec.update.arg.arrays(executor,
list(data=mx.nd.array(test.array)), match.name=TRUE)
> mx.exec.forward(executor, is.train=FALSE)

We can see the names of the layers recorded in the executor, as we will be extracting the activation of a layer by its name (note the names can be different, we should use the corresponding ones):

> names(executor$ref.outputs)
[1] "convolution10_output" "activation15_output"
"pooling10_output" "convolution11_output"
[5] "activation16_output" "pooling11_output"
"fullyconnected10_output" "fullyconnected11_output"

Now, we can visualize the activations for the first and second convolutional layer and ReLu layer, as well as the first pooling layer.

Let's start with the ReLu activations for the first 16 filters in the first convolutional layer, which are called activation15_output in our case (again, the name may vary). For the first sample, (a "1"), we run the following scripts:

> par(mfrow=c(4,4), mar=c(0.1,0.1,0.1,0.1))
> for (i in 1:16) {
+ outputData <- as.array
(executor$ref.outputs$activation15_output)[,,i,1]
+ image(outputData, xaxt='n', yaxt='n',
col=grey.colors(255)
+ )
+ }

Similarly, for the second sample, (a "0"), we run:

> par(mfrow=c(4,4), mar=c(0.1,0.1,0.1,0.1))
> for (i in 1:16) {
+ outputData <- as.array
(executor$ref.outputs$activation15_output)[,,i,2]
+ image(outputData, xaxt='n', yaxt='n',
col=grey.colors(255)
+ )
+ }

We plot the activations of the first convolutional layer for a 1 (left) and a 0 (right) input image, respectively:

We can observe that each feature map effectively extracts the edges, curves and strikes of the digits.

We continue with the corresponding outputs of the first pooling layer called pooling10_output:

> par(mfrow=c(4,4), mar=c(0.1,0.1,0.1,0.1))
> for (i in 1:16) {
+ outputData <-as.array
(executor$ref.outputs$pooling10_output)[,,i,1]
+ image(outputData, xaxt='n', yaxt='n',
col=grey.colors(255)
+ )
+ }
> par(mfrow=c(4,4), mar=c(0.1,0.1,0.1,0.1))
> for (i in 1:16) {
+ outputData <- as.array
(executor$ref.outputs$pooling10_output)[,,i,2]
+ image(outputData, xaxt='n', yaxt='n',
col=grey.colors(255)
+ )
+ }

We plot the outputs of the first max pooling layer:

As we can easily tell, they are the downsampled versions of the convolution outputs.

Finally, we visualize one more layer, the second convolutional layer, which is labeled as convolution11_output. Take the first 16 feature maps as an example:

> par(mfrow=c(4,4), mar=c(0.1,0.1,0.1,0.1))
> for (i in 1:16) {
+ outputData <- as.array
(executor$ref.outputs$convolution11_output)[,,i,1]
+ image(outputData, xaxt='n', yaxt='n',
col=grey.colors(255)
+ )
+ }
> par(mfrow=c(4,4), mar=c(0.1,0.1,0.1,0.1))
> for (i in 1:16) {
+ outputData <- as.array
(executor$ref.outputs$convolution11_output)[,,i,2]
+ image(outputData, xaxt='n', yaxt='n',
col=grey.colors(255)
+ )
+ }

We plot the outputs of the second convolutional layer for two images, respectively:

The second convolutional layer extracts higher-level features of the input images, such as shape and contour. All these 50 feature maps combined together, provide rich representations which are then fed into the fully connected layers. Therefore, classification performance is increased thanks to the convolution operations.

So far, we have examined the effectiveness of our CNN model by plotting the learning curve and visualizing the convolution extracted features. We can do one more inspection on the generalization of our model. For instance, we can take out a subset of data from the testing set to simulate a validation set, and use the remainder as the new testing set. We perform early stopping based on the validation set, which provides instruction on how many iterations is good enough and does not overfit the model. Finally, we employ the trained model with early stopping on the testing set and see how well the model is able to generalize to unseen data.

We first split the testing set into validation (40%) and final testing set (60%):

> validation_perc = 0.4
> validation_index <- createDataPartition(data_test.y, p=validation_perc, list=FALSE)
>
> validation.array <- test.array[, , , validation_index]
> dim(validation.array) <- c(28, 28, 1,
length(validation.array[1,1,]))
> data_validation.y <- data_test.y[validation_index]
> final_test.array <- test.array[, , , -validation_index]
> dim(final_test.array) <- c(28, 28, 1,
length(final_test.array[1,1,]))
> data_final_test.y <- data_test.y[-validation_index]

To conduct early stopping, we write our custom callback function and will assign it to the parameter, epoch.end.callback. The callback function checks the classification performance on the validation set, and if it is greater than the threshold we set, the training stops:

> mx.callback.early.stop <- function(eval.metric) {
+ function(iteration, nbatch, env, verbose) {
+ if (!is.null(env$metric)) {
+ if (!is.null(eval.metric)) {
+ result <- env$metric$get(env$eval.metric)
+ if (result$value >= eval.metric) {
+ return(FALSE)
+ }
+ }
+ }
+ return(TRUE)
+ }
+ }

Now, we train the CNN model with early stopping based on the validation set where we set the stopping criteria as a validation accuracy greater than 0.985:

> model_cnn_earlystop <- mx.model.FeedForward.create(softmax,
X=train.array, y=data_train.y,
eval.data=list(data=validation.array, label=data_validation.y),
+ ctx=devices, num.round=30, array.batch.size=100,
+ learning.rate=0.05, momentum=0.9, wd=0.00001,
eval.metric=mx.metric.accuracy,
+ epoch.end.callback = mx.callback.early.stop(0.985))
Start training with 1 devices
[1] Train-accuracy=0.284571428571429
[1] Validation-accuracy=0.921395348837209
[2] Train-accuracy=0.959145569620254
[2] Validation-accuracy=0.972325581395349
[3] Train-accuracy=0.980221518987343
[3] Validation-accuracy=0.97906976744186
[4] Train-accuracy=0.986613924050634
[4] Validation-accuracy=0.982790697674419
[5] Train-accuracy=0.990537974683546
[5] Validation-accuracy=0.981627906976744
[6] Train-accuracy=0.992848101265824
[6] Validation-accuracy=0.985348837209302

Training stops after the sixth iteration as the criteria is met. Finally, the performance on the new testing set is examined:

> prob_cnn <- predict(model_cnn_earlystop, final_test.array)
> prediction_cnn <- max.col(t(prob_cnn)) - 1
> cm_cnn = table(data_final_test.y, prediction_cnn)
> cm_cnn
prediction_cnn
data_final_test.y 0 1 2 3 4 5 6 7 8 9
0 626 0 0 0 0 0 0 0 1 0
1 0 701 1 0 0 2 1 3 4 0
2 1 0 598 4 0 0 0 6 0 0
3 0 0 0 658 0 5 0 0 2 1
4 0 0 0 0 585 1 3 3 1 4
5 1 0 1 3 0 558 5 0 0 2
6 4 0 0 0 0 0 595 0 0 0
7 0 1 3 0 1 0 0 675 0 0
8 4 0 1 1 1 7 3 0 621 2
9 1 0 0 0 2 1 0 4 0 589
> accuracy_cnn = mean(prediction_cnn == data_final_test.y)
> accuracy_cnn
[1] 0.9855487