Elasticity in model inference
After the model is fully trained, we can use it for parallel model inference. However, traditional model inference also needs to predefine how many workers/GPUs to use for a serving job.
Here, we discuss a simple solution of elastic model serving. It works as follows:
- If the number of concurrent inference inputs is higher, we use more GPUs for this model-serving job.
- If the number of concurrent inference inputs is lower, we shrink down the number of GPUs we use.
For example, right now we have received four concurrent model-serving queries, as shown in the following figure:
Figure 11.12 – Elastic model serving with more queries
As shown in the preceding figure, if we have more queries, we can use more GPUs to do concurrent model serving in order to reduce the model-serving latency.
On the contrary, if we have fewer queries, for example, we only have one query, as shown in the following figure, we...