Transformers for language modeling
Transformers were introduced in a famous paper called Attention is All You Need (Vaswani et al., 2017) as a new approach for sequence-to-sequence data modeling tasks such as translating statements from one language into another (that is, machine translation). These models are built on top of the idea of self-attention, which helps the model pay attention to other important parts of a sentence or sequence of information in the learning process during training. This attention mechanism helps the models better understand the relationships between the elements of input sequences – for example, between the words in the input sequences in language modeling. Models built using transformers usually work better than ones built using predecessor techniques such as Long Short Term Memory (LSTM) and Recurrent Neural Networks (RNNs) (Vaswani et al., 2017; Devlin et al., 2018).
Figure 13.6 shows four traditional problems in language modeling that have...