Exploring two-phase model serving techniques
Two-phase model serving can be one of the following three types, depending on the strength of the models:
- Quantized phase one model
- Separately trained phase one model with reduced features
- Separately trained different phase one and phase two models
Quantized phase one model
With this type, we first develop the phase two model to be deployed on the server. Then, we carry out integer quantization of the phase two model to form the phase one model. Integer quantization is an optimization technique that converts floating point numbers to 8-bit integer numbers. This way, the size of the model can decrease by a certain degree.
For example, if we convert 64-bit floating point numbers to 8-bit integers, we can get up to an 8-times reduction (64/8 = 8). A basic example of reducing the size of a floating point NumPy array to a uint8 NumPy array is shown in the following code block:
import numpy as np import sys X =...