Optimizing your data pipeline on Amazon SageMaker
Remember that we’ve learned about ephemeral training on Amazon SageMaker, where you can seamlessly spin up anywhere from a few to hundreds, to thousands of GPUs on remote instances that are fully managed. Now, let’s learn about different options to optimize sending data to your SageMaker Training instances.
If you’ve worked with SageMaker Training, you’ll remember the different stages your job moves through: starting the instances, downloading your data, downloading your training image and invoking it, then uploading the finished model.
Here’s a screenshot from my 2022 re:Invent demo, featuring Stable Diffusion. You might ask yourself, how is it that I’m downloading 50 million image/text pairs in only two minutes? The answer is an optimized data pipeline. In this case, I used FSx for Lustre.
Figure 6.11 – Training job status
For much smaller datasets...