Achieving low-latency model inference
As ML models continue to grow and get deployed to different hardware devices, latency can become an issue for certain inference use cases that require low-latency and high-throughput inferences, such as real-time fraud detection.
To reduce the overall model inference latency for a real-time application, there are different optimization considerations and techniques we can use, including model optimization, graph optimization, hardware acceleration, and inference engine optimization.
In this section, we will focus on model optimization, graph optimization, and hardware optimization. Before we get into these various topics, let’s first understand how model inference works, specifically for DL models, since that’s what most of the inference optimization processes focus on.
How model inference works and opportunities for optimization
As we discussed earlier in this book, DL models are constructed as computational graphs...