Working with multimodal transformers
Multimodal transformers such as CLIP are very useful where more than one modality is involved. In order to provide a use-case for this kind of model, a very simple and straight forward example is given.
Zero-shot image classification can be very useful when only class names or phrases related to the classes are available. CLIP, which is a multimodal model, is able to represent images and texts in the same semantic space and can be very handy in this situation. To have a zero-shot image classifier with no prior knowledge about what classes can be or might be, imagine a case where class names are given with no examples for each of them. The only knowledge that is available at the moment is these names or maybe phrases and a set of images, which were not seen before.
- The first thing that is needed to perform this experiment is to have a few images or download them from the internet:
from PIL import Image import requests url = "http:...