To use the multimodality information, we will face a few core challenges, such as representation, translation, alignment, fusion, and co-learning (non-exclusive). In this section, we will briefly talk about each of them.
Challenges of multimodality learning
Representation
Representation refers to the computer-interpretable description of the multimodal data (for example, vector and tensor). It covers the following, but is not limited to:
- How to handle different symbols and signals—for example, in machine translation, Chinese characters and English characters are two distinct linguistic systems; in a self-driving system, point clouds from LIDAR sensors and image pixels from the RGB camera are two distinct sources with...