Job migration and multiplexing
Here, we'll discuss DNN training job migration and multiplexing. We will first discuss the motivation and operations for job migration.
Job migration
The first thing we will discuss here is why we need job migration. A simple example to understand this operation is shown in the following figure:
As shown in the preceding figure, in a cloud environment, there is the case that a single DNN training job can be split across multiple machines. As per one of our assumptions at the beginning of this chapter, cross-machine communication bandwidth is low. Therefore, if we conduct frequent model synchronization between GPU 1 and GPU 3, the network communication latency is very high. Thus, the system utilization is very low.
Due to the low system efficiency, we want to move GPUs working on the same job into the minimum number...