BERT – one of the autoencoding language models
Bidirectional Encoder Representations from Transformers, also known as BERT, was one of the first autoencoding language models to utilize the encoder Transformer stack with slight modifications for language modeling.
The BERT architecture is a multilayer Transformer encoder based on the Transformer original implementation. The Transformer model itself was originally for machine translation tasks, but the main improvement made by BERT is the utilization of this part of the architecture to provide better language modeling. This language model, after pretraining, is able to provide a global understanding of the language it is trained on.
BERT language model pretraining tasks
To have a clear understanding of the masked language modeling used by BERT, let's define it with more details. Masked language modeling is the task of training a model on input (a sentence with some masked tokens) and obtaining the output as the whole...