Summary
Natural language transformers have evolved into Foundation Models in a short time. Generative AI has reached new levels with ViT, CLIP, DALL-E, and GPT-4V.
We first explored the architecture of ViT, which breaks images down into words. We discovered that there is more than one way to implement models in real-world ML. Understanding the different approaches contributes to creating a personal toolbox to solve problems when implementing ML projects.
Then, we explored CLIP, which can associate words and images. Finally, we looked into the architecture of DALL-E. We went down to the tensor level to look under the hood of the structure of some of these innovative models. We then implemented the innovative DALL-E 2 and DALL-E 3 API.
Finally, we built a GPT-4V notebook with DALL-E 3 images, implementing an example of divergent semantic association.
The paradigm shift resides in the tremendous resources few organizations have to train Generative AI models on petabytes...