Foundation Models in Computer Vision
In the previous chapter, we learned about how we can build novel applications using NLP and CV techniques. However, this requires a significant amount of training, either from scratch or by fine-tuning a pre-trained model. When leveraging a pre-trained model, the model has generally been trained on a large corpus of data – for example, a dataset like ImageNet, which contains ~21 million images. However, on the internet, we have access to hundreds of millions of images and the alt text corresponding to those images. What if we pre-train models on internet-scale data and use those models for different applications involving object detection, segmentation, and text-to-image generation out of the box without any fine-tuning? This forms the bedrock of foundation models.
In this chapter, we will learn about:
- Leveraging image and text embeddings to identify the most relevant image for a given text and vice versa
- Leveraging...