Case study of Megatron-LM
Megatron-LM is a large-scale DNN training system developed at NVIDIA. It uses data parallelism and model parallelism together.
Let's first talk about how Megatron-LM splits models using model parallelism. Then, we will discuss how it is extended to use data parallelism as well.
Layer split for model parallelism
We will first illustrate how Megatron-LM uses model parallelism within a multi-GPU machine. Let's focus on a simple matrix multiplication case.
General Matrix Multiply (GEMM) is widely used in the DNN layers of language models.
Suppose we have matrix A, as shown in the following diagram:
As shown in the preceding diagram, for one particular layer of a language model, we have a weight matrix. We call the weight matrix A. A is a 4x4 weight matrix.
Now, let's assume we have some input data for this DNN layer. We call the input data...