Understanding BERT
BERT was published in 2019 by Devlin et al. based on the Transformer architecture [3]. It soon became the prevailing model in NLP. It helps the Transformer model to become even smarter. BERT teaches the Transformer to learn from the words before and after each word, so it knows the context and order better. This helps the Transformer understand tricky things such as jokes or words with multiple meanings, making it excellent at understanding all kinds of text, such as chatting or reading books.
How does it do that? BERT removes the unidirectionality constraint in the Transformer model and uses a masked language model (MLM) that randomly masks some of the input tokens. Since some tokens are masked, MLM has to predict the original vocabulary of the masked word based on its before and after context. To use the context before and after the masked word, MLM fits the data by using both left-to-right and right-to-left contexts. This is why it is called bidirectional,...