Exploring memory and storage resources
In this section, we will discuss another way to improve system throughput during model-parallel training and inference.
One big limitation for GPU-based DNN training is the on-device memory size. In this section, we will extend the GPU training memory size by leveraging other storage within the system, such as CPU memory, hard drive, and more.
Before we jump into our techniques, let's see the interconnection between the CPU, GPU, and disk, as shown in the following diagram:
As shown in the preceding diagram, with SOTA hardware machines such as NVIDIA DGX-1 and DGX-2, the storage specifications are as follows:
- The GPU memory is usually around 40 gigabytes (GB).
- The CPU memory (main memory) is around hundreds of GB (for example, 100 GB-200 GB).
- The disk storage is around tens of terabytes (TB).
On the connection side, both the GPU...