Evolution of CNN architectures
CNNs have been in existence since 1989, when the first multi-layered CNN was developed by Yann LeCun. This model could perform the visual cognition task of identifying handwritten digits. In 1998, LeCun developed an improved ConvNet model called LeNet. Due to its high accuracy in optical recognition tasks, LeNet was adopted for industrial use soon after its invention. Ever since, CNNs have been successful not only in academic research but also in practical industry use cases. The following diagram shows a brief timeline of architectural developments in the lifetime of CNNs, starting from 1989 all the way to 2020:
Figure 2.5: CNN architecture evolution – a broad picture
As we can see, there is a significant gap between the years 1998 and 2012. This was for two reasons:
- There wasn’t a dataset big and suitable enough to demonstrate the capabilities of CNNs, especially deep CNNs.
- The available computing power was limited.
And to add to the first reason, on the existing small datasets of the time such as MNIST, classical machine learning models such as SVMs were starting to beat CNN’s performance.
The above two limitations were alleviated as we transitioned from 1998 to 2012 and beyond. Firstly, we had an exponential growth in digital data thanks to the advent of the internet and access to affordable devices such as digital cameras and smartphones. Secondly, we saw an enormous increase in our computational capabilities including the arrival of GPUs.
These changes led to a few CNN developments. The ReLU activation function was developed in order to deal with the gradient explosion and decay problem during backpropagation. Non-random initialization of network parameter values proved to be crucial. Max pooling was invented as an effective method for subsampling. GPUs were getting popular for training neural networks, especially CNNs, at scale.
Finally, and most importantly, a large-scale dedicated dataset of annotated images called ImageNet [1] was created by a research group at Stanford. This dataset is still one of the primary benchmarking datasets for CNN models to date.
With all of these developments compounding over the years, in 2012, a different architectural design brought about a massive improvement in CNN performance on the ImageNet
dataset. This network was called AlexNet (named after the creator, Alex Krizhevsky). AlexNet, along with having various novel aspects such as random cropping and pretraining, established the trend of uniform and modular convolutional layer design. The uniform and modular layer structure was taken forward by repeatedly stacking such modules (of convolutional layers), resulting in very deep CNNs also known as VGGs.
Another approach of branching the blocks/modules of convolutional layers and stacking these branched blocks on top of each other proved extremely effective for tailored visual tasks. This network was called GoogLeNet (as it was developed at Google) or Inception v1 (inception being the term for those branched blocks). Several variants of the VGG and Inception networks followed, such as VGG16, VGG19, Inception v2, Inception v3, and so on.
The next phase of development began with skip connections. To tackle the problem of gradient decay while training CNNs, non-consecutive layers were connected via skip connections lest information dissipate between them due to small gradients. It should be noted that skip connections are essentially a special case of multi-path-based CNNs discussed earlier. A popular type of network that emerged with this trick, among other novel characteristics such as batch normalization, was ResNet.
A logical extension of ResNet was DenseNet, where layers were densely connected to each other; that is, each layer gets the input from all the previous layers’ output feature maps. Furthermore, hybrid architectures were then developed by mixing successful architectures from the past such as Inception-ResNet and ResNeXt, where the parallel branches within a block were increased in number.
Lately, the channel boosting technique has proven useful in improving CNN performance. The idea here is to learn novel features and exploit pre-learned features through transfer learning. Most recently, automatically designing new blocks and finding optimal CNN architectures has been a growing trend in CNN research. Examples of such CNNs are MnasNets and EfficientNets. The approach behind these models is to perform a neural architecture search to deduce an optimal CNN architecture with a uniform model scaling approach.
In the next section, we will go back to one of the earliest CNN models and take a closer look at the various CNN architectures developed since. We will build these architectures using PyTorch, training some of the models on real-world datasets. We will also explore PyTorch’s pretrained CNN models repository, popularly known as model-zoo. We will learn how to fine-tune these pretrained models as well as running predictions on them.