Hyperparameter tuning in model parallelism
In this section, we will discuss some of the important hyperparameters required during the model parallel training process, such as balancing the workload among GPUs and enabling/disabling pipeline parallelism.
Balancing the workload among GPUs
In most of the cases, we split the model layer-wise. Since we use homogenous GPUs, we should try to balance the workload among all the GPUs we have.
GPU workload is not always linearly proportional to the number of layers held inside the GPU. One way to balance the workload among GPUs is to look at its computation core utilization. This computation utility value can be found in nvidia-smi
. For example, the following screenshot shows that GPU0 has a greater workload than GPU1 – Volatile GPU-Util
on GPU0 is 42%
, whereas on GPU1, it is 20%
:
Thus, we need to move some of the layers originally assigned on GPU0 to GPU1...