Summary
In this chapter, you learned how to find your best base model, including the basics of architecture and the most common use cases and modalities, and you were given general guidance to start with the smallest model you can. You learned about key trade-offs, such as simplicity versus complexity, and applying many use cases versus applying only one. You received tactical guidance on how to find a base model with good support. You learned how to find your pretraining loss function, including masked language modeling, causal language modeling, and those common in vision models such as ViT and CoCa. We looked at the Alexa Teacher Model, and we learned how to use the scaling laws to solve for our model size, with help from a case study from Amazon Search.
Next up: working with accelerators!