In this section, we will learn about yet another interesting variant of BERT called VideoBERT. As the name suggests, along with learning the representation of language, VideoBERT also learns the representation of video. It is the first model that learns the representation of both video and language in a joint manner.
Just as we used a pre-trained BERT model and fine-tuned it for downstream tasks, we can also use a pre-trained VideoBERT model and fine-tune it for many interesting downstream tasks. VideoBERT is used for tasks such as image caption generation, video captioning, predicting the next frames of a video, and more.
But how exactly is VideoBERT pre-trained to learn video and language representations? Let's find out in the next section.
Pre-training a VideoBERT model
We know that the BERT model is pre-trained using two tasks, called masked language modeling (cloze task) and next sentence prediction. Can we also...