Working with AR language models
The Transformer architecture was originally intended to be effective for Seq2Seq tasks such as MT or summarization, but it has since been used in diverse NLP problems ranging from token classification to coreference resolution. Subsequent works began to use separately and more creatively the left and right parts of the architecture. The objective, also known as denoising objective, is to fully recover the original input from the corrupted one in a bidirectional fashion, as shown on the left side of Figure 4.1, which you will see shortly. As seen in the Bidirectional Encoder Representations from Transformers (BERT) architecture, which is a notable example of AE models, they can incorporate the context of both sides of a word. However, the first issue is that the corrupting [MASK]
symbols that are used during the pre-training phase are absent from the data during the fine-tuning phase, leading to a pre-training-fine-tuning discrepancy. Secondly, the BERT...