Getting Started with the Architecture of the Transformer Model
Language is the essence of human communication. Civilizations would never have been born without the word sequences that form language. We now mostly live in a world of digital representations of language. Our daily lives rely on NLP digitalized language functions: web search engines, emails, social networks, posts, tweets, smartphone texting, translations, web pages, speech-to-text on streaming sites for transcripts, text-to-speech on hotline services, and many more everyday functions.
In December 2017, Google Brain and Google Research published the seminal Vaswani et al., Attention Is All You Need paper. The Transformer was born. The Transformer outperformed the existing state-of-the-art NLP models. The Transformer trained faster than previous architectures and obtained higher evaluation results. As a result, transformers have become a key component of NLP.
Since 2017, transformer models such as OpenAI’s ChatGPT and GPT-4, Google’s PaLM and LaMBDA, and other Large Language Models (LLMs) have emerged. However, this is just the beginning! You need to understand how attention heads work to join this new era of LLM for AI experts.
The idea of the attention head of the Transformer is to do away with recurrent neural network features. In this chapter, we will open the hood of the Original Transformer model described by Vaswani et al. (2017) and examine the main components of its architecture. Then, we will explore the fascinating world of attention and illustrate the key components of the Transformer.
This chapter covers the following topics:
- The architecture of the Transformer
- The Transformer’s self-attention model
- The encoding and decoding stacks
- Input and output embedding
- Positional embedding
- Self-attention
- Multi-head attention
- Masked multi-attention
- Residual connections
- Normalization
- Feedforward network
- Output probabilities
With all the innovations and library updates in this cutting-edge field, packages and models change regularly. Please go to the GitHub repository for the latest installation and code examples: https://github.com/Denis2054/Transformers-for-NLP-and-Computer-Vision-3rd-Edition/tree/main/Chapter02.
You can also post a message in our Discord community (https://www.packt.link/Transformers) if you have any trouble running the code in this or any chapter.
Let’s dive directly into the structure of the original Transformer’s architecture.