All lossy methods of compression involve a potential problem: when you lose part of the information from your model, you should check how it performs after this. Retraining on the compressed model will help to adapt the network to the new constraints.
Network optimization techniques include:
- Weight quantization: Change computation precision. For example, the model can be trained in full precision (float32) and then compressed to int8. This improves the performance significantly.
- Weight pruning
- Weight decomposition
- Low rank approximation. Good approach for CPU.
- Knowledge distillation: Train a smaller model to predict an output of the bigger one.
- Dynamic memory allocation
- Layer and tensor fusion. The idea is to combine successive layers into one. This reduces the memory needed to store intermediate results.
At the moment, each of them has its own pros and cons...