Further reading
- MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers by Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, Ming Zhou: https://arxiv.org/abs/2002.10957.
- LLaMA: Open and Efficient Foundation Language Models by Hugo Touvron, Thibaut Lavril, Gautier Lzacard, et al.: https://arxiv.org/abs/2302.13971.
- Building an ONNX Runtime package: https://onnxruntime.ai/docs/build/custom.html#custom-build-packages.
Join our community on Discord
Join our community’s Discord space for discussions with the author and other readers: