Why multimodality?
In the context of Generative AI, multimodality refers to a model’s capability of processing data in various formats. For example, a multimodal model can communicate with humans via text, speech, images, or even videos, making the interaction extremely smooth and “human-like.”
In Chapter 1, we defined large foundation models (LFMs) as a type of pre-trained generative AI model that offers immense versatility by being adaptable for various specific tasks. LLMs, on the other hand, are a subset of foundation models that are able to process one type of data: natural language. Even though LLMs have proven to be not only excellent text understanders and generators but also reasoning engines to power applications and copilots, it soon became clear that we could aim at even more powerful applications.
The dream is to have intelligent systems that are capable of handling multiple data formats – text, images, audio, video, etc – always...