With the exciting progress of deep learning in different areas such as computer vision, natural language processing (NLP), robotics, and so on, there is an emerging trend to leverage multiple data sources to develop more powerful applications. This is called multimodality, which is a way of unifying different information sources, including images, text, speech, and so on.
In this chapter, we will discuss some fundamental progress in multimodality of using deep learning and share with you some advanced novel applications.