Single-machine multi-GPUs and multi-machine multi-GPUs
So far, we have discussed the main steps in data parallel training. In this section, we will explain two main types of hardware settings in data parallel training:
- The first type is a single machine with multiple GPUs. In this setting, all the in-parallel training tasks can be launched using either a single process or multiple processes.
- The second type is multiple machines with multiple GPUs. In this setting, we need to configure the network communication portals among all the machines. We also need to form a process group to synchronize both the cross-machine and cross-GPU training processes.
Single-machine multi-GPU
Compared to multi-machine multi-GPUs, the single-machine multiple-GPU setting is easier to set up. Before we discuss the implementation, let's check if the hardware configuration is good to go. Type the following command in the terminal:
$ nvidia-smi
If the NVIDIA driver and CUDA...