So far we used the salience detector discussed previously to find bounding boxes of proto-objects. We could simply apply the algorithm to every frame of a video sequence and get a good idea of the location of the objects. However, what is getting lost is correspondence information.
Imagine a video sequence of a busy scene, such as from a city center or a sports stadium. Although a saliency map could highlight all the proto-objects in every frame of a recorded video, the algorithm would have no way to establish a correspondence between proto-objects from the previous frame and proto-objects in the current frame.
Also, the proto-objects map might contain some false positives, and we need an approach to select the most probable boxes that correspond to real-world objects. Such false positives can be noticed in the following example:
data:image/s3,"s3://crabby-images/9cbac/9cbac8144e15017f504e7ac5cfe6e2eb9c62e347" alt=""
Note that the...