What can AI do in other domains?
Generative AI models have demonstrated impressive capabilities across modalities such as sound, music, video, and 3D shapes. In the audio domain, models can synthesize natural speech, generate original music compositions, and even mimic a speaker’s voice and the patterns of rhythm and sound (prosody).
Speech-to-text systems can convert spoken language into text (Automatic Speech Recognition (ASR)). For video, AI systems can create photorealistic footage from text prompts and perform sophisticated editing like object removal. 3D models learned to reconstruct scenes from images and generate intricate objects from textual descriptions.
There are many types of generative models, handling different data modalities across various domains, as shown in the following table:
Model Type |
Input |
Output |
Examples |
Text-to-Text |
Text |
Text |
Mixtral, GPT-4, Claude 3, Gemini |
Text-to-Image |
Text |
Images |
DALL-E 2, Stable Diffusion, Imagen |
Text-to-Audio |
Text |
Audio |
Jukebox, AudioLM, MusicGen |
Text-to-Video |
Text |
Video |
Sora |
Image-to-Text |
Images |
Text |
CLIP, DALL-E 3 |
Image-to-Image |
Images |
Images |
Super-resolution, style transfer, inpainting |
Text-to-Code |
Text |
Code |
Stable Diffusion, DALL-E 3, AlphaCode, Codex |
Video-to-Audio |
Video |
Audio |
Soundify |
Text-to-Math |
Text |
Mathematical Expressions |
ChatGPT, Claude |
Text-to-Scientific |
Text |
Scientific Output |
Minerva, Galactica |
Algorithm Discovery |
Text/Data |
Algorithms |
AlphaTensor |
Multimodal Input |
Text, Images |
Text, Images |
GPT-4V |
Table 1.1: Models for audio, video, and other domains
There are a lot more combinations of modalities to consider; these are just some that I have come across. We could consider subcategories of text, such as text-to-math, which generates mathematical expressions from text, where some models such as ChatGPT and Claude shine; or text-to-code, which are models that generate programming code from text, such as AlphaCode and Codex. A few models specialize in scientific text, such as Minerva and Galactica, or algorithm discovery, such as AlphaTensor.
A few models work with several modalities for input or output. An example of a model that demonstrates generative capabilities in multimodal input is OpenAI’s GPT-4V model (GPT-4 with vision), released in September 2023, which takes both text and images and comes with better Optical Character Recognition (OCR) than previous versions to read text from images. Images can be translated into descriptive words, then text filters are applied. This mitigates the risk of generating unconstrained image captions.
As the table shows, text is a common input modality that can be converted into various outputs, like image, audio, and video. The outputs can also be converted back into text or kept within the same modality. LLMs have driven rapid progress for text-focused domains. These models enable a diverse range of capabilities via different modalities and domains. The LLM categories are the main focus of this book; however, we’ll also occasionally look at other models, text-to-image in particular. These models typically use a Transformer architecture trained on massive datasets via self-supervised learning.
Underlying many of these innovations are advances in deep generative architectures like GANs, diffusion models, and transformers. Leading AI labs at Google, OpenAI, Meta, and DeepMind are leading the way in innovation.