Hands-on lab – running distributed model training with PyTorch
In this hands-on lab, you will use SageMaker Training Service to run data parallel distributed training. We will use PyTorch's torch.nn.parallel.DistributedDataParallel
API as the distributed training framework and run the training job on a small cluster. We will reuse the dataset and training scripts from the hands-on lab in Chapter 8, Building a Data Science Environment Using AWS Services.
All right, let's get started!
Modifying the training script
First, we need to add distributed training support to the training script. To start, create a copy of the train.py
file, rename the file train-dis.py
, and open the train-dis.py
file. You will need to make changes to the following three main functions. The following steps are meant to highlight the key changes needed. To run the lab, you can download the modified train-dis.py
file from https://github.com/PacktPublishing/The-Machine-Learning-Solutions...