Deploying a language model with ONNX, TensorRT, and NVIDIA Triton Server
The three tools are ONNX, TensorRT, and NVIDIA Triton Server. ONNX and TensorRT are meant to perform GPU-based inference acceleration, while NVIDIA Triton Server is meant to host HTTP or GRPC APIs. We will explore these three tools practically in this section. TensorRT is known to perform the best model optimization toward the GPU to speed up inference, while NVIDIA Triton Server is a battle-tested tool for hosting DP models that have compatibility with TensorRT natively. ONNX, on the other hand, is an intermediate framework in the setup, which we will use primarily to host the weight formats that are directly supported by TensorRT.
In this practical tutorial, we will be deploying a Hugging Face-sourced language model that can be supported on most NVIDIA GPU devices. We will be converting our PyTorch-based language model from Hugging Face into ONNX weights, which will allow TensorRT to load the Hugging Face...