Case study of Mesh-TensorFlow
We discussed Megatron-LM in detail due to its popularity. Now, we will briefly discuss Mesh-TensorFlow in this section.
This approach is quite easy to understand. Basically, Mesh-TensorFlow combines data and model parallelism by allowing users to configure two dimensions—that is, batch and model dimensions—as shown in the following diagram:
As shown in the preceding diagram, mesh-tensorflow
allows users to set parallelism levels in two dimensions, as follows:
- Batch dimension: How many concurrent batches to train (data parallelism)
- Model dimension: How many splits over the model (model parallelism)
As shown in Figure 9.13, let's assume the user sets both batch dimension as 2 and model dimension as 2. This means that we use two GPUs to do model-parallel training, and we have two groups of this two-GPU model parallelism...