In this recipe, we will learn how to answer questions about the content of a specific image. This is a powerful form of Visual Q&A based on a combination of visual features extracted from a pre-trained VGG16 model together with word clustering (embedding). These two sets of heterogeneous features are then combined into a single network where the last layers are made up of an alternating sequence of Dense and Dropout. This recipe works on Keras 2.0+.
Therefore, this recipe will teach you how to:
- Extract features from a pre-trained VGG16 network.
- Use pre-built word embeddings for mapping words into a space where similar words are adjacent.
- Use LSTM layers for building a language model. LSTM will be discussed in Chapter 6 and for now we will use them as black boxes.
- Combine different heterogeneous input features to create a combined...