Chapter 6: Training and Tuning at Scale
Machine learning (ML) practitioners face multiple challenges when training and tuning models at scale. Scale challenges come in the form of high volumes of training data and increased model size and model architecture complexity. Additional challenges come from having to run a large number of tuning jobs to identify the right set of hyperparameters and keeping track of multiple experiments conducted with varying algorithms for a specific ML objective. Scale challenges lead to long training times, resource constraints, and increased costs. This can reduce the productivity of teams, and potentially create a bottleneck for ML projects.
Amazon SageMaker provides managed distributed training and tuning capabilities to improve training efficiency, and capabilities to organize and track ML experiments at scale. SageMaker enables techniques such as streaming data into algorithms by using pipe mode for training with data at scale and Managed Spot Training...