To generate feature vectors, we will use a pretrained inception network trained on the ImageNet dataset to categorize images in different categories.
We will remove the last layer (the fully connected layer) and only keep the feature vector that is generated after a max-pooling operation.
Another option would be to keep the output of the layer just before average-pooling, that is, the higher-dimensional feature maps. However, in our example, we will not need spatial information—whether the action takes place in the middle of the frame or in the corner, the predictions will be the same. Therefore, we will use the output of the two-dimensional max-pooling layer. This will make the training faster, since the input of the LSTM will be 64 times smaller (64 = 8 × 8 = the size of a feature map for an input image of size 299 × 299).
TensorFlow allows us to access a pretrained model with a single line, as described in Chapter 4, Influential Classification...