A video minus the audio can be thought of as a collection of images arranged in a sequential manner. The important features from those images can be extracted using a convolutional neural network trained on specific image classification problems, such as ImageNet. The activations of the last fully connected layer of a pre-trained network can be used to derive features from the sequentially sampled images from the video. The frequency rate at which to sample the images sequentially from the video depends on the type of content in the video and can be optimized through training.
Illustrated in the following diagram (Figure 5.1) is a pre-trained neural network used for extracting features from a video:
As we can see from the preceding diagram, the sequentially sampled...