Chapter 5: Splitting the Model
In this chapter, we will discuss how to train giant models with model parallelism. Giant models refers to models that are too large to fit into a single GPU's memory. Some examples of giant models include Bidirectional Encoder Representations from Transformers (BERT), Generative Pre-Trainer Transformer (GPT): GPT-2 and GPT-3.
In contrast to data parallel workloads, model parallelism is often adopted for language models. Language models are a specific type of deep learning model that works in the Natural Language Processing (NLP) domain. Here, the input data is usually text sequences. The model outputs predictions for tasks such as question answering and next sentence prediction.
NLP model training is often segregated into two different types, pre-training and fine-tuning. Pre-training means training the whole giant model from scratch, which may need a huge amount of data and plenty of training epochs. Fine-tuning uses pre-trained models as...