Visual question answering
Imagine a scenario where you are given an image and are asked to answer certain questions by looking at that image. This is a task of visual question answering (VQA). A high-level strategy for VQA could be to leverage a pre-trained image to encode image information, encode the question (text) using a large language model (LLM), and then use the image and text representations to generate (decode) the answer – essentially, a multimodal model, which has input in both text and image mode.
One way of performing visual question answering is by getting the caption corresponding to the image and then performing question answering on the caption.
To understand the reason why we cannot use this, let’s look at the following image:
Figure 15.19: Sample image
A set of possible questions for the same caption of the original image are:
Extracted caption |
Question... |