In Chapter 7, Computer Vision, we showed how to detect and segment objects in single images. The objects in these images were fixed. However, if we add a temporal dimension to our input, objects can move within a certain scene. Understanding what is happening throughout multiple frames (a video) is a much harder task. In this recipe, we want to demonstrate how to get started when tackling videos. We will focus on combining a CNN and an RNN. The CNN is used to extract features for single frames; these features are combined and used as input for an RNN. This is also known as stacking, where we build (stack) a second model on top of another model.
For this recipe, we will be using a dataset that contains 13,321 short videos. These videos are distributed over a total of 101 different classes. Because of the complexity of this task, we don...