The core idea of YOLO is this: reframing object detection as a single regression problem. What does this mean? Instead of using a sliding window or another complex technique, we will divide the input into a w × h grid, as represented in this diagram:
Figure 5.3: An example involving a plane taking off. Here, w = 5, h = 5, and B = 2, meaning, in total, 5 × 5 × 2 = 50 potential boxes, but only 2 are shown in the image
For each part of the grid, we will define B bounding boxes. Then, our only task will be to predict the following for each bounding box:
- The center of the box
- The width and height of the box
- The probability that this box contains an object
- The class of said object
Since all those predictions are numbers, we have therefore transformed the object detection problem into a regression problem.
It is important to make a distinction between the grid cells that divide the pictures into equal parts (w × h parts to...