You're reading from Mastering PyTorch Create and deploy deep learning models from CNNs to multimodal models, LLMs, and beyond

Product type Paperback

Published in May 2024

Publisher Packt

ISBN-13 9781801074308

Length 558 pages

Edition 2nd Edition

Tools

PyTorch

Concepts

Deep Learning

Author (1):

Ashish Ranjan Jha

View More author details

Table of Contents (21) Chapters

Preface

1. Overview of Deep Learning Using PyTorch

2. Deep CNN Architectures FREE CHAPTER

3. Combining CNNs and LSTMs

4. Deep Recurrent Model Architectures

5. Advanced Hybrid Models

6. Graph Neural Networks

7. Music and Text Generation with PyTorch

8. Neural Style Transfer

9. Deep Convolutional GANs

10. Image Generation Using Diffusion

11. Deep Reinforcement Learning

12. Model Training Optimizations

13. Operationalizing PyTorch Models into Production

14. PyTorch on Mobile Devices

15. Rapid Prototyping with PyTorch

16. PyTorch and AutoML

17. PyTorch and Explainable AI

18. Recommendation Systems with PyTorch

19. PyTorch and Hugging Face

20. Index

Developing LeNet from scratch

LeNet, originally known as LeNet-5, is one of the earliest CNN models, developed in 1998. The number 5 in LeNet-5 represents the total number of layers in this model, that is, two convolutional and three fully connected layers. With roughly 60,000 total parameters, this model gave state-of-the-art performance on image recognition tasks for handwritten digit images in the year 1998. As expected from a CNN model, LeNet demonstrated rotation, position, and scale invariance as well as robustness against distortion in images. Contrary to the classical machine learning models of the time, such as SVMs, which treated each pixel of the image separately, LeNet exploited the correlation among neighboring pixels.

Note that although LeNet was developed for handwritten digit recognition, it can certainly be extended for other image classification tasks, as we shall see in our next exercise. The following diagram shows the architecture of a LeNet model:

Figure 2.6: LeNet architecture

As mentioned earlier, there are two convolutional layers followed by three fully connected layers (including the output layer). This approach of stacking convolutional layers followed by fully connected layers later became a common practice in CNN research and is still applied to the latest CNN models.

This is because as we reach the final convolutional layer output, the output has small spatial dimensions (length and width) but a high depth, which makes the output look like an embedding of the input image. This embedding is like a vector that can be fed into a fully connected network, which is essentially a bunch of fully connected layers. Besides these layers, there are pooling layers in between. These are basically subsampling layers that reduce the spatial size of image representation, thereby reducing the number of parameters and computations as well as effectively condensing the input information. The pooling layer used in LeNet was an average pooling layer that had trainable weights. Soon after, max pooling emerged as the most commonly used pooling function in CNNs.

The numbers in brackets in each layer in the figure demonstrate the dimensions (for input, output, and fully connected layers) or window size (for convolutional and pooling layers). The expected input size for a grayscale image is 32x32 pixels. This image is then operated on by 5x5 convolutional kernels, followed by 2x2 pooling, and so on. The output layer size is 10, representing the 10 classes.

In this section, we will use PyTorch to build LeNet from scratch and train and evaluate it on a dataset of images for the task of image classification. We will see how easy and intuitive it is to build the network architecture in PyTorch using the outline from Figure 2.6.

Furthermore, we will demonstrate how effective LeNet is, even on a dataset different from the ones it was originally developed on (that is, MNIST) and how PyTorch makes it easy to train and test the model in a few lines of code.

Using PyTorch to build LeNet

Observe the following steps to build the model:

For this exercise, we will need to import a few dependencies. Execute the following import statements:

import numpy as np
import matplotlib.pyplot as plt
import torch
import torchvision
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as transforms
torch.use_deterministic_algorithms(True)

Besides the usual imports, we also invoke the use_deterministic_algorithms function to ensure the reproducibility of this exercise.

Next, we will define the model architecture based on the outline given in Figure 2.6:

class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        # 3 input image channel, 6 output 
        # feature maps and 5x5 conv kernel
        self.cn1 = nn.Conv2d(3, 6, 5)
        # 6 input image channel, 16 output 
        # feature maps and 5x5 conv kernel
        self.cn2 = nn.Conv2d(6, 16, 5)
        # fully connected layers of size 120, 84 and 10
        # 5*5 is the spatial dimension at this layer
        self.fc1 = nn.Linear(16 * 5 * 5, 120) 
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
    def forward(self, x):
        # Convolution with 5x5 kernel
        x = F.relu(self.cn1(x))
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(x, (2, 2))
        # Convolution with 5x5 kernel
        x = F.relu(self.cn2(x))
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(x, (2, 2))
        # Flatten spatial and depth dimensions 
        # into a single vector
        x = x.view(-1, self.flattened_features(x))
        # Fully connected operations
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
    def flattened_features(self, x):
        # all except the first (batch) dimension
        size = x.size()[1:]  
        num_feats = 1
        for s in size:
            num_feats *= s
        return num_feats
lenet = LeNet()
print(lenet)

In the last two lines, we instantiate the model and print the network architecture. The output will be as follows:

LeNet(
  (conv1): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)

There are the usual __init__ and forward methods for architecture definition and running a forward pass, respectively. The additional flattened_features method is meant to calculate the total number of features in an image representation layer (usually an output of a convolutional layer or pooling layer). This method helps to flatten the spatial representation of features into a single vector of numbers, which is then used as input to fully connected layers.

Besides the details of the architecture mentioned earlier, ReLU is used throughout the network as the activation function. Also, unlike the original LeNet network, which takes in single-channel images, the current model is modified to accept RGB images, that is, three channels, as input. This is done in order to adapt to the dataset that is used for this exercise.

We then define the training routine, that is, the actual backpropagation step:

def train(net, trainloader, optim, epoch):
    # initialize loss
    loss_total = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        # ip refers to the input images, and ground_truth 
        # refers to the output classes the images belong to
        ip, ground_truth = data
        # zero the parameter gradients
        optim.zero_grad()
        # forward-pass + backward-pass + optimization -step
        op = net(ip)
        loss = nn.CrossEntropyLoss()(op, ground_truth)
        loss.backward()
        optim.step()
        # update loss
        loss_total += loss.item()
        # print loss statistics
        if (i+1) % 1000 == 0:
            # print at the interval of 1000 mini-batches
            print('[Epoch number : %d, Mini-batches: %5d] \
                  loss: %.3f' % (epoch + 1, i + 1, 
                                 loss_total / 200))
            loss_total = 0.0

For each epoch, this function iterates through the entire training dataset, runs a forward pass through the network, and, using backpropagation, updates the parameters of the model based on the specified optimizer. After iterating through each of the 1,000 mini-batches of the training dataset, this method also logs the calculated loss.

Similar to the training routine, we will define the test routine that we will use to evaluate model performance:

def test(net, testloader):
    success = 0
    counter = 0
    with torch.no_grad():
        for data in testloader:
            im, ground_truth = data
            op = net(im)
            _, pred = torch.max(op.data, 1)
            counter += ground_truth.size(0)
            success += (pred == ground_truth).sum().item()
    print('LeNet accuracy on 10000 images from test dataset: %d %%'\
        % (100 * success / counter))

This function runs a forward pass through the model for each test-set image, calculates the correct number of predictions, and prints the percentage of correct predictions on the test set.

Before we get on to training the model, we need to load the dataset. For this exercise, we will be using the CIFAR-10 dataset.

Dataset citation

The images in this section are from Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf. They are part of the CIFAR-10 dataset (toronto.edu): https://www.cs.toronto.edu/~kriz/cifar.html

This dataset consists of 60,000 32x32 RGB images labeled across 10 classes, with 6,000 images per class. The 60,000 images are split into 50,000 training images and 10,000 test images. More details can be found at the dataset website [2]. Torch provides the CIFAR10 dataset under the torchvision.datasets module. We will be using the module to directly load the data and instantiate train and test dataloaders as demonstrated in the following code:

# The mean and std are kept as 0.5 for normalizing 
# pixel values as the pixel values are originally 
# in the range 0 to 1
train_transform = transforms.Compose(
    [transforms.RandomHorizontalFlip(),
     transforms.RandomCrop(32, 4),
     transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), 
                          (0.5, 0.5, 0.5))])
trainset = torchvision.datasets.CIFAR10(root='./data', 
    train=True, download=True, transform=train_transform)
trainloader = torch.utils.data.DataLoader(trainset, 
    batch_size=8, shuffle=True)
test_transform = transforms.Compose([transforms.ToTensor(), 
    transforms.Normalize((0.5, 0.5, 0.5), 
                         (0.5, 0.5, 0.5))])
testset = torchvision.datasets.CIFAR10(root='./data', 
    train=False, download=True, transform=test_transform)
testloader = torch.utils.data.DataLoader(testset, 
    batch_size=10000, shuffle=False)
# ordering is important
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 
           'frog', 'horse', 'ship', 'truck')

In the next chapter, we will download the dataset and write a custom dataset class and a dataloader function. We will not need to write those here, thanks to the torchvision.datasets module.

Because we set the download flag to True, the dataset will be downloaded locally. Then, we shall see the following output:

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
100%
170498071/170498071 [00:34<00:00, 5191345.41it/s]
Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified

The transformations used for training and testing datasets are different because we apply some data augmentation to the training dataset, such as flipping and cropping, which are not applicable to the test dataset. Also, after defining trainloader and testloader, we declare the 10 classes in this dataset with a pre-defined ordering.

After loading the datasets, let’s investigate how the data looks:

# define a function that displays an image
def imageshow(image):
    # un-normalize the image
    image = image/2 + 0.5
    npimage = image.numpy()
    plt.imshow(np.transpose(npimage, (1, 2, 0)))
    plt.show()
# sample images from training set
dataiter = iter(trainloader)
images, labels = next(dataiter)
# display images in a grid
num_images = 4
imageshow(torchvision.utils.make_grid(images[:num_images]))
# print labels
print('    '+'  ||  '.join(classes[labels[j]] 
                           for j in range(num_images)))

The preceding code shows us four sample images with their respective labels from the training dataset. The output will be as follows:

Figure 2.7: CIFAR-10 dataset samples

The preceding output shows us four color images that are 32x32 pixels in size. These four images belong to four different labels, as displayed in the text following the images.

We will now train the LeNet model.

Training LeNet

Let us train the model with the help of the following steps:

We will define the optimizer and start the training loop as shown here:

# define optimizer
optim = torch.optim.Adam(lenet.parameters(), lr=0.001)
# training loop over the dataset multiple times
for epoch in range(50):  
    train(lenet, trainloader, optim, epoch)
    print()
    test(lenet, testloader)
    print()
print('Finished Training')

The output will be as follows:

[Epoch number : 1, Mini-batches:  1000] loss: 9.804
[Epoch number : 1, Mini-batches:  2000] loss: 8.783
[Epoch number : 1, Mini-batches:  3000] loss: 8.444
[Epoch number : 1, Mini-batches:  4000] loss: 8.118
[Epoch number : 1, Mini-batches:  5000] loss: 7.819
[Epoch number : 1, Mini-batches:  6000] loss: 7.672
LeNet accuracy on 10000 images from test dataset: 44 %
...
[Epoch number : 50, Mini-batches:  1000] loss: 5.022
[Epoch number : 50, Mini-batches:  2000] loss: 5.067
[Epoch number : 50, Mini-batches:  3000] loss: 5.137
[Epoch number : 50, Mini-batches:  4000] loss: 5.009
[Epoch number : 50, Mini-batches:  5000] loss: 5.107
[Epoch number : 50, Mini-batches:  6000] loss: 4.977
LeNet accuracy on 10000 images from test dataset: 67 %
Finished Training

Once the training is finished, we can save the model file locally:

model_path = './cifar_model.pth'
torch.save(lenet.state_dict(), model_path)

Having trained the LeNet model, we will now test its performance on the test dataset in the next section.

Testing LeNet

The following steps need to be followed to test the LeNet model:

Let’s make predictions by loading the saved model and running it on the test dataset:

# load test dataset images
d_iter = iter(testloader)
im, ground_truth = next(d_iter)
# print images and ground truth
imageshow(torchvision.utils.make_grid(im[:4]))
print('Label:      ', ' '.join('%5s' % 
                               classes[ground_truth[j]] 
                               for j in range(4)))
# load model
lenet_cached = LeNet()
lenet_cached.load_state_dict(torch.load(model_path))
# model inference
op = lenet_cached(im)
# print predictions
_, pred = torch.max(op, 1)
print('Prediction: ', ' '.join('%5s' % classes[pred[j]] 
                               for j in range(4)))

The output will be as follows:

Figure 2.8: LeNet predictions

Evidently, three out of four predictions are correct.

Finally, we will check the overall accuracy of this model on the test dataset as well as the per-class accuracy:

success = 0
counter = 0
with torch.no_grad():
    for data in testloader:
        im, ground_truth = data
        op = lenet_cached(im)
        _, pred = torch.max(op.data, 1)
        counter += ground_truth.size(0)
        success += (pred == ground_truth).sum().item()
print('Model accuracy on 10000 images from test dataset: %d %%'\ 
    % (100 * success / counter))

The output will be as follows:

Model accuracy on 10000 images from test dataset: 67 %

For per-class accuracy, the code is as follows:

class_sucess = list(0. for i in range(10))
class_counter = list(0. for i in range(10))
with torch.no_grad():
    for data in testloader:
        im, ground_truth = data
        op = lenet_cached(im)
        _, pred = torch.max(op, 1)
        c = (pred == ground_truth).squeeze()
        for i in range(10000):
            ground_truth_curr = ground_truth[i]
            class_sucess[ground_truth_curr] += c[i].item()
            class_counter[ground_truth_curr] += 1
for i in range(10):
    print('Model accuracy for class %5s : %2d %%' % (
        classes[i], 100 * class_sucess[i] / class_counter[i]))

The output will be as follows:

Model accuracy for class plane : 70 %
Model accuracy for class   car : 83 % 
Model accuracy for class  bird : 45 %  
Model accuracy for class   cat : 37 %  
Model accuracy for class  deer : 80 %  
Model accuracy for class   dog : 52 %  
Model accuracy for class  frog : 81 %  
Model accuracy for class horse : 71 %  
Model accuracy for class  ship : 76 %  
Model accuracy for class truck : 74 %

Some classes have better performance than others. Overall, the model is far from perfect (that is, 100% accuracy) but much better than a model making random predictions, which would have an accuracy of 10% (due to the 10 classes).

Having built a LeNet model from scratch and evaluated its performance using PyTorch, we will now move on to a successor of LeNet – AlexNet. For LeNet, we built the model from scratch, trained, and tested it. For AlexNet, we will use a pretrained model, fine-tune it on a smaller dataset, and test it.

You're reading from Mastering PyTorch Create and deploy deep learning models from CNNs to multimodal models, LLMs, and beyond

Table of Contents (21) Chapters

Developing LeNet from scratch

Using PyTorch to build LeNet

Training LeNet

Testing LeNet

Authors (1)

Personalised recommendations for you