Leveraging idle links and host resources
In the previous section, we discussed how the communication bottleneck of model synchronization may cause up to 50% of the end-to-end DNN training time. In addition, the widely used NCCL Ring All-Reduce directly abandons some of the scarce communication links if they cannot form a ring.
In this section, we will discuss how we can fully leverage all the communication links within a data parallel training environment. Then, we will discuss how to extend it to using idle links on the host (that is, CPU) side.
Tree All-Reduce
Let's continue using the previous four-GPU fully connected example. As we discussed in the previous section (and as shown in Figure 4.7), the two links in the middle are unused, which is a waste of scarce communication resources.
Now, let's introduce a new All-Reduce protocol, which is called Tree All-Reduce. It also works in two steps:
- First, it sends a portion of the gradients to other nodes...