Developing LeNet from scratch
LeNet, originally known as LeNet-5, is one of the earliest CNN models, developed in 1998. The number 5 in LeNet-5 represents the total number of layers in this model, that is, two convolutional and three fully connected layers. With roughly 60,000 total parameters, this model gave state-of-the-art performance on image recognition tasks for handwritten digit images in the year 1998. As expected from a CNN model, LeNet demonstrated rotation, position, and scale invariance as well as robustness against distortion in images. Contrary to the classical machine learning models of the time, such as SVMs, which treated each pixel of the image separately, LeNet exploited the correlation among neighboring pixels.
Note that although LeNet was developed for handwritten digit recognition, it can certainly be extended for other image classification tasks, as we shall see in our next exercise. The following diagram shows the architecture of a LeNet model:
Figure 2.6: LeNet architecture
As mentioned earlier, there are two convolutional layers followed by three fully connected layers (including the output layer). This approach of stacking convolutional layers followed by fully connected layers later became a common practice in CNN research and is still applied to the latest CNN models.
This is because as we reach the final convolutional layer output, the output has small spatial dimensions (length and width) but a high depth, which makes the output look like an embedding of the input image. This embedding is like a vector that can be fed into a fully connected network, which is essentially a bunch of fully connected layers. Besides these layers, there are pooling layers in between. These are basically subsampling layers that reduce the spatial size of image representation, thereby reducing the number of parameters and computations as well as effectively condensing the input information. The pooling layer used in LeNet was an average pooling layer that had trainable weights. Soon after, max pooling emerged as the most commonly used pooling function in CNNs.
The numbers in brackets in each layer in the figure demonstrate the dimensions (for input, output, and fully connected layers) or window size (for convolutional and pooling layers). The expected input size for a grayscale image is 32x32 pixels. This image is then operated on by 5x5 convolutional kernels, followed by 2x2 pooling, and so on. The output layer size is 10, representing the 10 classes.
In this section, we will use PyTorch to build LeNet from scratch and train and evaluate it on a dataset of images for the task of image classification. We will see how easy and intuitive it is to build the network architecture in PyTorch using the outline from Figure 2.6.
Furthermore, we will demonstrate how effective LeNet is, even on a dataset different from the ones it was originally developed on (that is, MNIST) and how PyTorch makes it easy to train and test the model in a few lines of code.
Using PyTorch to build LeNet
Observe the following steps to build the model:
- For this exercise, we will need to import a few dependencies. Execute the following
import
statements:import numpy as np import matplotlib.pyplot as plt import torch import torchvision import torch.nn as nn import torch.nn.functional as F import torchvision.transforms as transforms torch.use_deterministic_algorithms(True)
Besides the usual imports, we also invoke the use_deterministic_algorithms
function to ensure the reproducibility of this exercise.
- Next, we will define the model architecture based on the outline given in Figure 2.6:
class LeNet(nn.Module): def __init__(self): super(LeNet, self).__init__() # 3 input image channel, 6 output # feature maps and 5x5 conv kernel self.cn1 = nn.Conv2d(3, 6, 5) # 6 input image channel, 16 output # feature maps and 5x5 conv kernel self.cn2 = nn.Conv2d(6, 16, 5) # fully connected layers of size 120, 84 and 10 # 5*5 is the spatial dimension at this layer self.fc1 = nn.Linear(16 * 5 * 5, 120) self.fc2 = nn.Linear(120, 84) self.fc3 = nn.Linear(84, 10) def forward(self, x): # Convolution with 5x5 kernel x = F.relu(self.cn1(x)) # Max pooling over a (2, 2) window x = F.max_pool2d(x, (2, 2)) # Convolution with 5x5 kernel x = F.relu(self.cn2(x)) # Max pooling over a (2, 2) window x = F.max_pool2d(x, (2, 2)) # Flatten spatial and depth dimensions # into a single vector x = x.view(-1, self.flattened_features(x)) # Fully connected operations x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x def flattened_features(self, x): # all except the first (batch) dimension size = x.size()[1:] num_feats = 1 for s in size: num_feats *= s return num_feats lenet = LeNet() print(lenet)
In the last two lines, we instantiate the model and print the network architecture. The output will be as follows:
LeNet(
(conv1): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1))
(conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
(fc1): Linear(in_features=400, out_features=120, bias=True)
(fc2): Linear(in_features=120, out_features=84, bias=True)
(fc3): Linear(in_features=84, out_features=10, bias=True)
)
There are the usual __init__
and forward
methods for architecture definition and running a forward pass, respectively. The additional flattened_features
method is meant to calculate the total number of features in an image representation layer (usually an output of a convolutional layer or pooling layer). This method helps to flatten the spatial representation of features into a single vector of numbers, which is then used as input to fully connected layers.
Besides the details of the architecture mentioned earlier, ReLU is used throughout the network as the activation function. Also, unlike the original LeNet network, which takes in single-channel images, the current model is modified to accept RGB images, that is, three channels, as input. This is done in order to adapt to the dataset that is used for this exercise.
- We then define the training routine, that is, the actual backpropagation step:
def train(net, trainloader, optim, epoch): # initialize loss loss_total = 0.0 for i, data in enumerate(trainloader, 0): # get the inputs; data is a list of [inputs, labels] # ip refers to the input images, and ground_truth # refers to the output classes the images belong to ip, ground_truth = data # zero the parameter gradients optim.zero_grad() # forward-pass + backward-pass + optimization -step op = net(ip) loss = nn.CrossEntropyLoss()(op, ground_truth) loss.backward() optim.step() # update loss loss_total += loss.item() # print loss statistics if (i+1) % 1000 == 0: # print at the interval of 1000 mini-batches print('[Epoch number : %d, Mini-batches: %5d] \ loss: %.3f' % (epoch + 1, i + 1, loss_total / 200)) loss_total = 0.0
For each epoch, this function iterates through the entire training dataset, runs a forward pass through the network, and, using backpropagation, updates the parameters of the model based on the specified optimizer. After iterating through each of the 1,000 mini-batches of the training dataset, this method also logs the calculated loss.
- Similar to the training routine, we will define the test routine that we will use to evaluate model performance:
def test(net, testloader): success = 0 counter = 0 with torch.no_grad(): for data in testloader: im, ground_truth = data op = net(im) _, pred = torch.max(op.data, 1) counter += ground_truth.size(0) success += (pred == ground_truth).sum().item() print('LeNet accuracy on 10000 images from test dataset: %d %%'\ % (100 * success / counter))
This function runs a forward pass through the model for each test-set image, calculates the correct number of predictions, and prints the percentage of correct predictions on the test set.
- Before we get on to training the model, we need to load the dataset. For this exercise, we will be using the
CIFAR-10
dataset.
Dataset citation
The images in this section are from Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf. They are part of the CIFAR-10 dataset (toronto.edu): https://www.cs.toronto.edu/~kriz/cifar.html
This dataset consists of 60,000 32x32 RGB images labeled across 10 classes, with 6,000 images per class. The 60,000 images are split into 50,000 training images and 10,000 test images. More details can be found at the dataset website [2]. Torch provides the CIFAR10
dataset under the torchvision.datasets
module. We will be using the module to directly load the data and instantiate train and test dataloaders as demonstrated in the following code:
# The mean and std are kept as 0.5 for normalizing
# pixel values as the pixel values are originally
# in the range 0 to 1
train_transform = transforms.Compose(
[transforms.RandomHorizontalFlip(),
transforms.RandomCrop(32, 4),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5),
(0.5, 0.5, 0.5))])
trainset = torchvision.datasets.CIFAR10(root='./data',
train=True, download=True, transform=train_transform)
trainloader = torch.utils.data.DataLoader(trainset,
batch_size=8, shuffle=True)
test_transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5),
(0.5, 0.5, 0.5))])
testset = torchvision.datasets.CIFAR10(root='./data',
train=False, download=True, transform=test_transform)
testloader = torch.utils.data.DataLoader(testset,
batch_size=10000, shuffle=False)
# ordering is important
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog',
'frog', 'horse', 'ship', 'truck')
In the next chapter, we will download the dataset and write a custom dataset class and a dataloader
function. We will not need to write those here, thanks to the torchvision.datasets
module.
Because we set the download
flag to True
, the dataset will be downloaded locally. Then, we shall see the following output:
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
100%
170498071/170498071 [00:34<00:00, 5191345.41it/s]
Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
The transformations used for training and testing datasets are different because we apply some data augmentation to the training dataset, such as flipping and cropping, which are not applicable to the test dataset. Also, after defining trainloader
and testloader
, we declare the 10 classes in this dataset with a pre-defined ordering.
- After loading the datasets, let’s investigate how the data looks:
# define a function that displays an image def imageshow(image): # un-normalize the image image = image/2 + 0.5 npimage = image.numpy() plt.imshow(np.transpose(npimage, (1, 2, 0))) plt.show() # sample images from training set dataiter = iter(trainloader) images, labels = next(dataiter) # display images in a grid num_images = 4 imageshow(torchvision.utils.make_grid(images[:num_images])) # print labels print(' '+' || '.join(classes[labels[j]] for j in range(num_images)))
The preceding code shows us four sample images with their respective labels from the training dataset. The output will be as follows:
Figure 2.7: CIFAR-10 dataset samples
The preceding output shows us four color images that are 32x32 pixels in size. These four images belong to four different labels, as displayed in the text following the images.
We will now train the LeNet model.
Training LeNet
Let us train the model with the help of the following steps:
- We will define the
optimizer
and start the training loop as shown here:# define optimizer optim = torch.optim.Adam(lenet.parameters(), lr=0.001) # training loop over the dataset multiple times for epoch in range(50): train(lenet, trainloader, optim, epoch) print() test(lenet, testloader) print() print('Finished Training')
The output will be as follows:
[Epoch number : 1, Mini-batches: 1000] loss: 9.804
[Epoch number : 1, Mini-batches: 2000] loss: 8.783
[Epoch number : 1, Mini-batches: 3000] loss: 8.444
[Epoch number : 1, Mini-batches: 4000] loss: 8.118
[Epoch number : 1, Mini-batches: 5000] loss: 7.819
[Epoch number : 1, Mini-batches: 6000] loss: 7.672
LeNet accuracy on 10000 images from test dataset: 44 %
...
[Epoch number : 50, Mini-batches: 1000] loss: 5.022
[Epoch number : 50, Mini-batches: 2000] loss: 5.067
[Epoch number : 50, Mini-batches: 3000] loss: 5.137
[Epoch number : 50, Mini-batches: 4000] loss: 5.009
[Epoch number : 50, Mini-batches: 5000] loss: 5.107
[Epoch number : 50, Mini-batches: 6000] loss: 4.977
LeNet accuracy on 10000 images from test dataset: 67 %
Finished Training
- Once the training is finished, we can save the model file locally:
model_path = './cifar_model.pth' torch.save(lenet.state_dict(), model_path)
Having trained the LeNet model, we will now test its performance on the test dataset in the next section.
Testing LeNet
The following steps need to be followed to test the LeNet model:
- Let’s make predictions by loading the saved model and running it on the test dataset:
# load test dataset images d_iter = iter(testloader) im, ground_truth = next(d_iter) # print images and ground truth imageshow(torchvision.utils.make_grid(im[:4])) print('Label: ', ' '.join('%5s' % classes[ground_truth[j]] for j in range(4))) # load model lenet_cached = LeNet() lenet_cached.load_state_dict(torch.load(model_path)) # model inference op = lenet_cached(im) # print predictions _, pred = torch.max(op, 1) print('Prediction: ', ' '.join('%5s' % classes[pred[j]] for j in range(4)))
The output will be as follows:
Figure 2.8: LeNet predictions
Evidently, three out of four predictions are correct.
- Finally, we will check the overall accuracy of this model on the test dataset as well as the per-class accuracy:
success = 0 counter = 0 with torch.no_grad(): for data in testloader: im, ground_truth = data op = lenet_cached(im) _, pred = torch.max(op.data, 1) counter += ground_truth.size(0) success += (pred == ground_truth).sum().item() print('Model accuracy on 10000 images from test dataset: %d %%'\ % (100 * success / counter))
The output will be as follows:
Model accuracy on 10000 images from test dataset: 67 %
- For per-class accuracy, the code is as follows:
class_sucess = list(0. for i in range(10)) class_counter = list(0. for i in range(10)) with torch.no_grad(): for data in testloader: im, ground_truth = data op = lenet_cached(im) _, pred = torch.max(op, 1) c = (pred == ground_truth).squeeze() for i in range(10000): ground_truth_curr = ground_truth[i] class_sucess[ground_truth_curr] += c[i].item() class_counter[ground_truth_curr] += 1 for i in range(10): print('Model accuracy for class %5s : %2d %%' % ( classes[i], 100 * class_sucess[i] / class_counter[i]))
The output will be as follows:
Model accuracy for class plane : 70 %
Model accuracy for class car : 83 %
Model accuracy for class bird : 45 %
Model accuracy for class cat : 37 %
Model accuracy for class deer : 80 %
Model accuracy for class dog : 52 %
Model accuracy for class frog : 81 %
Model accuracy for class horse : 71 %
Model accuracy for class ship : 76 %
Model accuracy for class truck : 74 %
Some classes have better performance than others. Overall, the model is far from perfect (that is, 100% accuracy) but much better than a model making random predictions, which would have an accuracy of 10% (due to the 10 classes).
Having built a LeNet model from scratch and evaluated its performance using PyTorch, we will now move on to a successor of LeNet – AlexNet. For LeNet, we built the model from scratch, trained, and tested it. For AlexNet, we will use a pretrained model, fine-tune it on a smaller dataset, and test it.