An overview of popular and well-known models
After the seminal paper Attention is All You Need, a very large number of alternative transformer-based models have been proposed. Let’s review some of the most popular and well-known ones.
BERT
BERT, or Bidirectional Encoder Representations from Transformers, is a language representation model developed by the Google AI research team in 2018. Let’s go over the main intuition behind that model:
- BERT considers the context of each word from both the left and the right side using the so-called “bidirectional self-attention.”
- Training happens by randomly masking the input word tokens, and avoiding cycles so that words cannot see themselves indirectly. In NLP jargon, this is called “fill in the blank.” In other words, the pretraining task involves masking a small subset of unlabeled inputs and then training the network to recover these original inputs. (This is an example of...