We will make use of the UCF101 dataset (https://www.crcv.ucf.edu/data/UCF101.php), which was put together by K. Soomro et al. (refer to UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild, CRCV-TR-12-01, 2012). Here are a few examples from the dataset:
The dataset is composed of 13,320 segments of video. Each segment contains a person performing one of 101 possible actions.
To classify the video, we will use a two-step process. Indeed, a recurrent network is not fed the raw pixel images. While it could technically be fed with full images, CNN feature extractors are used beforehand in order to reduce the dimensionality, and to reduce the computations done by LSTMs. Therefore, our network architecture can be represented by Figure 8-5:
As stated earlier, backpropagating...