Inference time depends on the Floating-Point Operations Per Second (FLOPS) required to run a model with hardware. The FLOPS is influenced by the number of model parameters and floating-point operations involved. The floating-point operations are mostly matrix operations, such as addition, products, and division. For example, a convolution operation has a few parameters representing the kernel, but takes longer to compute, as the operation has to be performed across the input matrix. In the case of a fully connected layer, the parameters are huge, but run quickly.
The weights of the model are usually double or high precision floating-point values, and an arithmetic operation on such numbers is more expensive than performing an operation on quantized values. In the next section, we will illustrate how quantizing the weights affects the model's performance...