Summary
In this chapter, we described the two most popular AWS services designed for deploying a DL model as an inference endpoint: EKS and SageMaker. For both options, we started with the simplest setting: creating an inference endpoint from TF, PyTorch, or ONNX models. Then, we explained how to improve the performance of an inference endpoint using the EI accelerator, AWS Neuron, and AWS SageMaker Neo. We also covered how to set up autoscaling to handle the changes in the traffic more effectively. Finally, we discussed the MME feature of SageMaker that is used to host multiple models on a single inference endpoint.
In the next chapter, we will look at various model compression techniques: network quantization, weight sharing, network pruning, knowledge distillation, and network architecture search. These techniques will increase the inference efficiency even further.