We began this chapter by understanding the basic idea of BERT. We learned that BERT can understand the contextual meaning of words and generate embeddings according to context, unlike context-free models such as word2vec, which generate embeddings irrespective of the context.
Next, we looked into the workings of BERT. We understood that Bidirectional Encoder Representation from Transformer (BERT), as the name suggests, is basically the transformer model.
Following on from this, we looked into the different configurations of BERT. We learned that the BERT-base consists of 12 encoder layers, 12 attention heads, and 768 hidden units, while BERT-large consists of 24 encoder layers, 16 attention heads, and 1,024 hidden units.
Moving on, we learned how the BERT model is pre-trained using two interesting tasks, called masked language modeling and NSP. We learned that in masked language modeling, we mask 15% of the tokens and train BERT to predict the masked tokens, while, in the NSP...