Advanced Training Concepts
In this chapter, we will cover advanced training concepts at scale, such as evaluating throughput, calculating model teraFLOPS (TFLOPS) per device, compiling, and using the scaling laws to determine the right length of training time. In the last chapter, you learned about how to do large-scale training on SageMaker, in general terms. In this chapter, you’ll learn about particularly complex and sophisticated techniques you can use to drive down the overall cost of your job. This lower cost directly translates to higher model performance because you can train for longer on the same budget.
We will cover the following topics in this chapter:
- Evaluating and improving throughput with model TFLOPS
- Using FlashAttention to speed up your training runs
- Speeding up your jobs with compilation
- Amazon SageMaker Training Compiler and Neo
- Running compiled models on Amazon’s Trainium and Inferentia custom hardware
- Solving for an...