Generating Image Descriptions Using BLIP-2 and LLaVA
Imagine you have an image in hand and need to upscale it or generate new images based on it, but you don’t have the prompt or description associated with it. You may say, “Fine, I can write up a new prompt for it.” For one image, that is acceptable, what if there are thousands or even millions of images without descriptions? It is impossible to write them all up manually.
Fortunately, we can use artificial intelligence (AI) to help us generate descriptions. There are many pretrained models that can achieve this goal, and the number is always increasing. In this chapter, I am going to introduce two AI solutions to generate the caption, description, or prompt for an image, all fully automated:
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models [1]
- LLaVA: Large Language and Vision Assistant [3]
BLIP-2 [1] is fast and requires relatively low...