Inference describes the process of using a deep learning model to get predictions. It is measured in images per second or seconds per image. Models must run between 5 and 30 images per second to be considered real-time processing. Before we can improve inference speed, we need to measure it properly.
If a model can process i images per second, we can always run N inference pipelines simultaneously to boost performance—the model will then be able to process N × i images per second. While parallelism benefits many applications, it would not work for real-time applications.
In a real-time context, such as with a self-driving car, no matter how many images can be processed in parallel, what matters is latency—how long it takes to compute predictions for a single image. Therefore, for real-time applications, we only measure the latency of a model—how much time it takes to process a single image.
For non-real-time applications, you can run...