Although we already addressed issues in object detection from static images by introducing convolution-sliding windows, our model still may not output very accurate bounding boxes, even with several bounding box sizes. Let's see how YOLO solves that problem well:
We need to label our training data in some specific way so that the YOLO algorithm will work correctly. YOLO V2 format requires bounding box dimensions of bx, by and bh, bw in order to be relative to the original image width and height.
First, we normally go to each image and mark the objects we want to detect. After that, each image is split into a smaller number of rectangles (boxes), usually, 13 x 13 rectangles, but here, for simplicity, we have 8 x 9. Both the bounding box (blue) and the...