2. Anchor boxes
From the discussion in the previous section, we learned that object detection must predict both the bounding box region and the category of the object inside it. Suppose for the meantime our focus is on bounding box coordinates estimation.
How can a network predict the coordinates (xmin,ymin) and (xmax,ymax)? A network can make an initial guess such as (0,0) and (w, h) corresponding to the upper left corner pixel coordinates and the lower right corner pixel coordinates of the image. w is the image width, while h is the image height. Then, the network iteratively corrects the estimates by performing regression on the ground truth bounding box coordinates.
Estimating bounding box coordinates using raw pixels is not optimal due to high variance of possible pixel values. Instead of raw pixels, SSD minimizes pixel error values between the ground truth bounding box and predicted bounding box coordinates. For this example, the pixel error values are (xmin...