The rise of billion-parameter transformer models
The speed at which transformers went from small models trained for NLP tasks to models that require little to no fine-tuning is staggering.
Vaswani et al. (2017) introduced the Transformer, which surpassed CNNs and RNNs on BLEU tasks. Radford et al. (2018) introduced the Generative Pre-Training model (GPT) that could perform downstream tasks with fine-tuning. Devlin et al. (2019) perfected fine-tuning with the BERT model. Radford et al. (2019) went further with GPT-2 models.
Brown et al. (2020) defined a GPT-3 zero-shot approach to transformers that do not require fine-tuning!
At the same time, Wang et al. (2019) created GLUE to benchmark NLP models. But transformer models evolved so quickly that they surpassed human baselines!
Wang et al. (2019, 2020) rapidly created SuperGLUE, set the human baselines much higher, and made the NLU/NLP tasks more challenging. Transformers are rapidly progressing on the SuperGLUE leaderboards...