Building large-scale language models by leveraging unlabeled data
In this section, we will discuss popular large-scale transformer models that emerged from the original transformer. One common theme among these transformers is that they are pre-trained on very large, unlabeled datasets and then fine-tuned for their respective target tasks. First, we will introduce the common training procedure of transformer-based models and explain how it is different from the original transformer. Then, we will focus on popular large-scale language models including Generative Pre-trained Transformer (GPT), Bidirectional Encoder Representations from Transformers (BERT), and Bidirectional and Auto-Regressive Transformers (BART).
Pre-training and fine-tuning transformer models
In an earlier section, Attention is all we need: introducing the original transformer architecture, we discussed how the original transformer architecture can be used for language translation. Language translation is...