Transformer architecture
Even though the transformer architecture is different from recurrent networks, it uses many ideas that originated in recurrent networks. It represents the next evolutionary step of deep learning architectures that work with text, and as such, should be an essential part of your toolbox. The transformer architecture is a variant of the Encoder-Decoder architecture, where the recurrent layers have been replaced with Attention layers. The transformer architecture was proposed by Vaswani, et al. [30], and a reference implementation provided, which we will refer to throughout this discussion.
Figure 7 shows a seq2seq network with attention and compares it to a transformer network. The transformer is similar to the seq2seq with Attention model in the following ways:
- Both source and target are sequences
- The output of the last block of the encoder is used as context or thought vector for computing the Attention model on the decoder
- The target sequences...