References
Please go through the following content for more information on a few topics covered in the chapter:
- CoCa: Contrastive Captioners are Image-Text Foundation Models: https://arxiv.org/abs/2205.01917
- CLIP: Connecting text and images: https://openai.com/blog/clip/
- MASKED VISION AND LANGUAGE MODELING FOR MULTI-MODAL REPRESENTATION LEARNING: https://arxiv.org/pdf/2208.02131.pdf
- Language Models are Few-Shot Learners: https://arxiv.org/abs/2005.14165
- Hierarchical Text-Conditional Image Generation with CLIP Latents: https://cdn.openai.com/papers/dall-e-2.pdf
- Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding: https://arxiv.org/abs/2205.11487
- Flamingo: a Visual Language Model for Few-Shot Learning: https://arxiv.org/pdf/2204.14198.pdf
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer: https://arxiv.org/pdf/1910.10683.pdf
- AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION...