CLIP
Contrastive Language-Image Pre-Training (CLIP) is a multimodal transformer that can be used for image classification. CLIP’s process can be summed up as follows:
- A feature extractor, like ViT, produces image tokens.
- Text also is an input in tokens, as in ViT
- The attention layer learns the relationships between the image tokens and the text tokens with some form of “cross-attention.”
- The output is also raw logits, as in ViT.
We will first look into the basic architecture of CLIP before running CLIP in code.
The basic architecture of CLIP
The model is contrastive: the images are trained to learn how they fit together through their differences and similarities. The image and captions find their way toward each other through (joint text, image) pretraining. After pretraining, CLIP learns new tasks.
CLIP is transferable because it can learn new visual concepts, like GPT models, such as action recognition in video...