Summary
LLMs are very large transformers with various modifications to accommodate the large size. In this chapter, we discussed these modifications, as well as the qualitative differences between LLMs and regular transformers. First, we focused on their architecture, including more efficient attention mechanisms such as sparse attention and prefix decoders. We also discussed the nuts and bolts of the LLM architecture. Next, we surveyed the latest LLM architectures with special attention given to the GPT and LlaMa series of models. Then, we discussed LLM training, including training datasets, the Adam optimization algorithm, and various performance improvements. We also discussed the RLHF technique and the emergent abilities of LLMs. Finally, we introduced the Hugging Face Transformers library.
In the next chapter, we’ll discuss transformers for computer vision (CV), multimodal transformers, and we’ll continue our introduction to the Transformers library.