7
The Attention Mechanism and Transformers
In Chapter 6, we outlined a typical natural language processing (NLP) pipeline, and we introduced recurrent neural networks (RNNs) as a candidate architecture for NLP tasks. But we also outlined their drawbacks—they are inherently sequential (that is, not parallelizable) and cannot process longer sequences, because of the limitations of their internal sequence representation. In this chapter, we’ll introduce the attention mechanism, which allows a neural network (NN) to have direct access to the whole input sequence. We’ll briefly discuss the attention mechanism in the context of RNNs since it was first introduced as an RNN extension. However, the star of this chapter will be the transformer—a recent NN architecture that relies entirely on attention. Transformers have been one of the most important NN innovations in the past 10 years. They are at the core of all recent large language models (LLMs), such as ChatGPT...