Building an Image Search Engine Using CLIP: a Multimodal Approach
In the previous chapter, we focused on Transformer models such as BERT and GPT, leveraging their capabilities for sequence learning tasks. In this chapter, we’ll explore a multimodal model, which seamlessly connects visual and textual data. With its dual encoder architecture, this model learns the relationships between visual and textual concepts, enabling it to excel in tasks involving image and text. We will delve into its architecture, key components, and learning mechanisms, leading to a practical implementation of the model. We will then build a multimodal image search engine with text-to-image and image-to-image capabilities. To top it all off, we will tackle an awesome zero-shot image classification project!
We will cover the following topics in this chapter:
- Introducing the CLIP model
- Getting started with the dataset
- Architecting the CLIP model
- Finding images with words...