Summary
In this chapter, we covered a set of techniques that you can use to improve inference latency by reducing the model size. We introduced the three most popular techniques, along with complete examples in TF and PyTorch: network quantization, weight sharing, and network pruning. We also described techniques that reduce the model size by modifying the network architecture directly: knowledge distillation and NAS.
In the next chapter, we will explain how to deploy TF and PyTorch models on mobile devices where the techniques described in this section can be useful.