At 30 frames per second, processing every frame of a video implies analyzing 30 × 60 = 180 frames per minute. This problem was faced really early in computer vision, before the rise of deep learning. Techniques were then devised to analyze videos efficiently.
The most obvious technique is sampling. We can analyze only one or two frames per second instead of all the frames. While more efficient, we may lose information if an important scene appears very briefly, such as in the case of a gunshot, which was mentioned earlier.
A more advanced technique is scene extraction. This is particularly popular for analyzing movies. An algorithm detects when the video is changing from one scene to another. For instance, if the camera goes from a close-up view to a wide view, we would analyze a frame from each framing. Even if the close-up is really short and the wide view occurs over many frames, we would extract only one frame from each shot. Scene extraction...