Summary
In this chapter, we talked about Transformer models. First, we looked at the Transformer at a microscopic level to understand the inner workings of the model. We saw that Transformers use self-attention, a powerful technique to attend to other inputs in the text sequences while processing one input. We also saw that Transformers use positional embeddings to inform the model about the relative position of tokens in addition to token embeddings. We also discussed that Transformers leverage residual connections (that is, shortcut connections) and layer normalization in order to improve model training.
We then discussed BERT, an encoder-based Transformer model. We looked at the format of the data accepted by BERT and the special tokens it uses in the input. Next, we discussed four different types of task BERT can solve: sequence classification, token classification, multiple-choice, and question-answering.
Finally, we looked at how BERT is pre-trained on a large corpus...