Summary
In this chapter, we looked at several ways to improve inference performance and reduce inference cost. These methods include using batch inference where possible, deploying several models behind a single inference endpoint to reduce costs and help with advanced canary or blue/green deployments, scaling inference endpoints to meet demand, and using Elastic Inference and SageMaker Neo to provide better inference performance at a lower cost.
In the next chapter, we'll discuss monitoring and other important operational aspects of ML.