Other autoencoding models
In this part, we will review autoencoding model alternatives that slightly modify the original BERT. These alternative re-implementations aim to get better downstream tasks by exploiting many sources: optimizing the pretraining process and the number of layers or heads, improving data quality, designing better objective functions, and so forth. The source of improvements roughly falls into two parts: better architectural design choice and pretraining control.
Many effective alternatives have been shared lately, so it is impossible to understand and explain them all here. We can take a look at some of the most cited models in the literature and the most used ones on NLP benchmarks. Let’s start with A Lite BERT (ALBERT) as a re-implementation of BERT that focuses especially on architectural design choice.
Introducing ALBERT
The performance of language models is considered to improve as their size gets bigger. However, training such models is...