5. SSD model architecture
Figure 11.5.1 shows the model architecture of SSD that implements the conceptual framework of multi-scale single-shot object detection. The network accepts an RGB image and outputs several levels of prediction. A base or backbone network extracts features for the downstream task of classification and offset predictions. A good example of a backbone network is ResNet50 that is similar to what was discussed, implemented, and evaluated in Chapter 2, Deep Neural Networks. After the backbone network, the object detection task is performed by the rest of the network which we call SSD head.
The backbone network can be a pre-trained network with frozen weights (for example; previously trained for ImageNet classification) or jointly trained with object detection. If we used a pre-trained base network, we take advantage of reusing previously learned feature extraction filters from a large dataset. In addition, it accelerates learning as the backbone network...