Attention is all we need: introducing the original transformer architecture
Interestingly, the original transformer architecture is based on an attention mechanism that was first used in an RNN. Originally, the intention behind using an attention mechanism was to improve the text generation capabilities of RNNs when working with long sentences. However, only a few years after experimenting with attention mechanisms for RNNs, researchers found that an attention-based language model was even more powerful when the recurrent layers were deleted. This led to the development of the transformer architecture, which is the main topic of this chapter and the remaining sections.
The transformer architecture was first proposed in the NeurIPS 2017 paper Attention Is All You Need by A. Vaswani and colleagues (https://arxiv.org/abs/1706.03762). Thanks to the self-attention mechanism, a transformer model can capture long-range dependencies among the elements in an input sequence—in an...