Elasticity in model inference
After the model is fully trained, we can use it for parallel model inference. However, traditional model inference also needs to predefine how many workers/GPUs to use for a serving job.
Here, we discuss a simple solution of elastic model serving. It works as follows:
- If the number of concurrent inference inputs is higher, we use more GPUs for this model-serving job.
- If the number of concurrent inference inputs is lower, we shrink down the number of GPUs we use.
For example, right now we have received four concurrent model-serving queries, as shown in the following figure:
As shown in the preceding figure, if we have more queries, we can use more GPUs to do concurrent model serving in order to reduce the model-serving latency.
On the contrary, if we have fewer queries, for example, we only have one query, as shown in the following figure, we...