Summary
In this chapter, we covered some advanced concepts in training large-scale vision and language models. First, you learned how to evaluate and improve throughput by computing model TFLOPS per GPU, and using this as one of a number of metrics to compare experimental results. You learned about FlashAttention, and how its I/O-aware optimized quadratic for-loop speeds up the Transformer self-attention mechanism by as much as 3–5 times. You learned about compilation using methods built into PyTorch natively and those managed by AWS. You also learned about a few different types of compilation methods. You learned to update your hyperparameters for compilation, in addition to cases where the compilation is expected to provide a boost (or not).
You also learned about how to use compilers to run on Amazon’s custom hardware for machine learning, Trainium, and Inferentia. Lastly, we used the scaling laws to solve for an optimal train time.
In the next chapter, you&...