We started the chapter by learning how VideoBERT works. We learned how VideoBERT is pre-trained by predicting the masked language and visual tokens. We also learned that VideoBERT's final pre-training objective function is the weighted combination of text-only, video-only, and text-video methods. Later, we explored different applications of VideoBERT.
Then, we learned that BART is essentially a transformer model with an encoder and an decoder. We feed corrupted text to the encoder and the encoder learns the representation of the given text and sends the representation to the decoder. The decoder takes the representation produced by the encoder and reconstructs the original uncorrupted text. We also saw that BART uses a bidirectional encoder and a unidirectional decoder.
We also explored different noising techniques, such as token masking, token deletion, token infilling, sentence shuffling, and document rotation, in detail. Then, we learned how to perform text summarization...