Hands-on lab – running distributed model training with PyTorch
As an ML solutions architect, you need to explore and design different model-training paradigms to meet different model-training requirements. In this hands-on lab, you will use the SageMaker Training service to run data-parallel distributed training. We will use PyTorch’s torch.nn.parallel.DistributedDataParallel
API as the distributed training framework and run the training job on a small cluster. We will reuse the dataset and training scripts from the hands-on lab in Chapter 8, Building a Data Science Environment Using AWS ML Services.
Problem statement
In Chapter 8, we trained a financial sentiment model using the data science environment you created using SageMaker. The model was trained using a single GPU in the Studio Notebook and SageMaker training service. Anticipating the future needs of model training with large datasets, we need to design an ML training process using multiple GPUs to scale...