BLIP-2 – Bootstrapping Language-Image Pre-training
In the BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation paper [4], Junnan Li et al. proposed a solution to bridge the gap between natural language and vision modalities. Notably, the BLIP model has demonstrated exceptional capabilities in generating high-quality image descriptions, surpassing existing benchmarks at the time of its publication.
The reason behind its excellent quality is that Junnan Li et al. used an innovative technique to build two models from their first pretrained model:
- Filter model
- Captioner model
The filter model can filter out low-quality text-image pairs, thus improving the training data quality, while its caption generation model can generate surprisingly good, short descriptions for the image. With the help of these two models, the authors of the paper not only improved the training data quality but also enlarged its size...