Introducing CLIP
Imagine a scenario where you have access to an image dataset. You are not given the labels corresponding to each image in the dataset. However, you have the information in the form of an exhaustive list of all the unique labels present in the image dataset. How would you assign the probable label for a given image?
CLIP comes to the rescue in such a scenario. CLIP provides an embedding corresponding to each image and label (text associated with the image – typically, the class of image or the caption of the image). This way, we can associate an image with the most probable label – where the similarity of image embeddings and the text embeddings of all the unique labels are calculated and the label with the highest similarity to the image embeddings is the most probable label.
In the next section, let us get an understanding of CLIP and build a CLIP model from scratch before we use a pre-trained one.
How CLIP works
To understand how CLIP...