Answering questions about images (visual Q&A)
One of the nice things about neural networks is that different media types can be combined together to provide a unified interpretation. For instance, Visual Question Answering (VQA) combines image recognition and text natural language processing. Training can use VQA (VQA is available at https://visualqa.org/), a dataset containing open-ended questions about images. These questions require an understanding of vision, language, and common knowledge to be answered. The following images are taken from a demo available at https://visualqa.org/.
Note the question at the top of the image, and the subsequent answers:
Figure 20.10: Examples of visual question and answers
If you want to start playing with VQA, the first thing is to get appropriate training datasets such as the VQA dataset, the CLEVR dataset (available at https://cs.stanford.edu/people/jcjohns/clevr/), or the FigureQA dataset (available at https://datasets...