Multimodal Generative Transformers
Models that can understand more than one type of input are called multimodal models. Multimodal learning has been one of the important fields of artificial intelligence (AI) and has attracted many researchers for a long time. Even today, these multimodal models are popular in various forms and shapes. In this chapter, we will learn about generative AI using multimodal models, especially text-to-image and text-to-music ones. You will learn about Stable Diffusion and how it works. Also, you will gain knowledge about MusicGen and AudioGen models.
The following topics will be covered in this chapter:
- Multimodal learning
- Stable Diffusion for text-to-image generation
- Music generation using MusicGen
- Text-to-speech generation using transformers