In this section, we will start combining convolutional, max pooling, dense, and recurrent layers to classify each frame of a video clip. Specifically, each video contains several human activities, which persist for multiple frames (though they move between frames) and may leave the frame. First, let's get a more detailed description of the dataset we will be using for this project.
Video classification using convolutional – LSTM
UCF101 – action recognition dataset
UCF101 is an action recognition dataset of realistic action videos, collected from YouTube and having 101 action categories covering 13,320 videos. The videos are collected with variations in camera motion, object appearance and pose, object scale...