Understanding BERT
In this section, we'll explore the most influential and commonly used Transformer model, BERT. BERT is introduced in Google's research paper here: https://arxiv.org/pdf/1810.04805.pdf.
What does BERT do exactly? To understand what BERT outputs, let's dissect the name:
- Bidirectional: Training on the text data is bi-directional, which means each input sentence is processed from left to right as well as from right to left.
- Encoder: An encoder encodes the input sentence.
- Representations: A representation is a word vector.
- Transformers: The architecture is transformer-based.
BERT is essentially a trained transformer encoder stack. Input into BERT is a sentence, and the output is a sequence of word vectors. The word vectors are contextual, which means that a word vector is assigned to a word based on the input sentence. In short, BERT outputs contextual word representations.
We have already seen a number of issues that...