Hosting distributed models on SageMaker
In Chapter 5, we covered distribution fundamentals, where you learned how to think about splitting up your model and datasets across multiple GPUs. The good news is that you can use this same logic to host the model. In this case, you’ll be more interested in model parallel, placing layers and tensors on multiple GPU partitions. You won’t actually need a data parallel framework, because we’re not using backpropagation. We’re only running a forward pass through the network and getting inference results. There’s no gradient descent or weight updating involved.
When would you use distributed model hosting? To integrate extremely large models into your applications! Generally, this is scoped to large language models. It’s rare to see vision models stretch beyond single GPUs. Remember, in Chapter 4, Containers and Accelerators on the Cloud, we learned about different sizes of GPU memory. This is just as...