Transformers
While we will discuss this topic in more detail in Chapter 9, NLP 2.0: Using Transformers to Generate Text, it important to note that Convolutional and Recursive Units have been replaced in many current applications by Transformers, a type of architecture first described in 2017. In a way, transformers combine the strengths of both recursive and convolutional networks. Like convolutional networks, they compute relative similarity between elements in a sequence or matrix; but unlike convolutional networks they perform this calculation between all elements rather than just locally. Like LSTMs, they preserve a context window through positional encoding elements, the all-to-all pairwise similarity (also known as self-attention), and pass through connections that resemble the memory units in LSTMs. However, unlike LSTMs, they can computed in parallel, enabling more efficient training.
Figure 2.17 Gives an overview of how this remarkable operation works; each element in a sequence...