LLaVA – Large Language and Vision Assistant
As suggested by its name LLaVA [3], this model is very close to LLaMA, not only in name but also in terms of their internals. LLaVA uses LLaMA as its language part. making it possible to swap out the language model if needed This is definitely a killer feature for many scenarios. One of the key features of Stable Diffusion is its openness for model swapping and fine-tuning. Similar to Stable Diffusion, LLaVA is designed to leverage open-sourced LLM models.
Next, let’s take a look at how LLaVA works.
How LLaVA works
The LLaVA authors, Haotian Liu et al. [3], present a beautiful, accurate diagram showing how the model leverages pretrained CLIP and LLaMA models in its architecture, as shown in the following figure:
Figure 15.2: Architecture of LLaVA
Let’s read the diagram from the bottom up. During the inference, we provide an image denoted as X v, and a language instruction denoted...