Before we dive deeper, the first question to ask is, what is multimodality/multimodal?
Modality refers to a certain type of information and/or the representation format in which information is stored. For example, humans have various sensory modalities, such as light, sound, and pressure. In our case, we are talking more about how the data is acquired and stored. For example, commonly available modalities include natural language (both spoken or written), visual information (from images or videos), audio (including voice, sounds, and music), Light Detection and Ranging (LIDAR) data, depth images, infrared images, functional MRI and physiological signals, electrocardiogram (ECG), and so on.
The path of leveraging multiple data sources for creating real-world applications is more of a need rather than choice, as in reality, dealing with multimodal...