We mentioned that tx, ty, tw, and th are used to compute the bounding box coordinates. Why not ask the network to output the coordinates directly (x, y, w, and h)? In fact, that is how it was done in YOLO v1. Unfortunately, this resulted in a lot of errors because objects vary in size.
Indeed, if most of the objects in the train dataset are big, the network will tend to predict w and h as being very large. And when using the trained model on small objects, it will often fail. To fix this problem, YOLO v2 introduced anchor boxes.
Anchor boxes (also called priors) are a set of bounding box sizes that are decided upon before training the network. For instance, when training a neural network to detect pedestrians, tall and narrow anchor boxes would be picked. An example is shown here:
A set of anchor...