Vanilla model parallelism is inefficient
As mentioned in a huge number of papers from academia and technical reports from the industry, vanilla model parallelism is very inefficient regarding GPU computation and memory utilization. To illustrate why vanilla model parallelism is not efficient, let's look at a simple DNN model, which is shown in Figure 6.1:
As shown in Figure 6.1, given the training input, we pass it into our three-layer NLP model. The layers are denoted as Layer 1, Layer 2, and Layer 3. After the forward propagation, the model will generate some output.
Now let's assume we use three GPUs. Each GPU only holds one layer of the original model. It is shown in Figure 6.2:
In Figure 6.2, we have GPU1 holding Layer 1 of the model. Similarly, we have GPU2 holding Layer 2 and GPU3 holding Layer 3.
Now, we...