The YOLO architecture is inspired by the image classification model created by GoogLeNet. The YOLO network consists of 24 convolutional layers, followed by two fully connected layers. It also has alternating 1×1 convolutional layers, which reduce the feature spaces from preceding layers.
The convolution layers that are used in YOLO are from the pre-trained model of the ImageNet task, sampled at half the resolution (244x244), and then double the resolution. YOLO uses leaky ReLU for all the layers and a linear activation function for the final layers.
The following diagram shows the model architecture of YOLO:
Fig 11.2: YOLO architecture
The following is a link to the official YOLO website: https://pjreddie.com/darknet/yolo/.
In the next section, we will learn about the different types of YOLO.