Introducing the fundamentals of distributed training
In the previous section, we highlighted how to apply a scale-up strategy to SageMaker Training jobs by simply specifying a large compute resource or large instance type. Implementing a scale-out strategy for the training process is just as straightforward. For example, we can increase the instance_count
parameter for the Training job from 1
to 2
and thereby instruct SageMaker to instantiate an ephemeral cluster consisting of 2 compute resources as opposed to 1 node. Thus, the following code snippet highlights what the estimator
variable configuration will look like:
... from sagemaker.pytorch import PyTorch estimator = PyTorch(entry_point='train.py', source_dir='src', role...