Some tasks relating video streams could naively be accomplished by studying each frame separately (memory less), but more efficient methods either take into account differences from image to image to guide the process to new frames or take complete image sequences as input for their predictions. Tracking, that is, localizing specific elements in a video stream, is a good example of such a task.
Tracking could be done frame by frame by applying detection and identification methods to each frame. However, it is much more efficient to use previous results to model the motion of the instances in order to partially predict their locations in future frames. Motion continuity is, therefore, a key predicate here, though it does not always hold (such as for fast-moving objects).