Like most image detection models, YOLO is based on a backbone model. The role of this model is to extract meaningful features from the image that will be used by the final layers. This is why the backbone is also called the feature extractor, a concept introduced in Chapter 4, Influential Classification Tools. The general YOLO architecture is depicted here:
While any architecture can be chosen as a feature extractor, the YOLO paper employs a custom architecture. The performance of the final model depends heavily on the choice of the feature extractor's architecture.
The final layer of the backbone outputs a feature volume of size w × h × D, where w × h is the size of the grid and D is the depth of the feature volume. For instance, for VGG-16, D = 512.
The size of the grid, w × h, depends on two factors:
- The stride...