Chapter 6: Pipeline Input and Layer Split
In this chapter, we will continue our discussion about model parallelism. Compared to data parallelism, model parallelism training often takes more GPUs/accelerators. Thus, system efficiency plays an important role during model parallelism training and inference.
We limit our discussion with the following assumptions:
- We assume the input data batches are the same size.
- In multi-layer perceptrons (MLPs), we assume they can be calculated with general matrix multiply (GEMM) functions.
- For each NLP job, we run it exclusively over a set of accelerators (for example, GPUs). This means there is no interference from other jobs.
- For each NLP job, we use the same type of accelerator (for example, GPUs).
- GPUs within a machine are connected with homogeneous links (for example, NVLink or PCIe).
- For cross-machine communication, the machines are also connected with homogeneous links (for example, an Ethernet cable).
- For...