Model servers and end-to-end hosting optimizations
You might be wondering: if SageMaker is hosting my model artifact and my inference script, how do I convert that into a real-time service that can respond to live traffic? The answer is model servers! For those of you who aren’t particularly interested in learning how to convert your model inference response into a RESTful interface, you’ll be happy to know this is largely abstracted on SageMaker for easy and fast prototyping. However, if you’d like to optimize your inference stack to deliver state-of-the-art model responses, read on.
There are five key types of latency to trim down as you are improving your model hosting response. Here’s how we can summarize them:
- Container latency: This refers to the time overhead involved in entering and exiting one of your containers. As we learned earlier, on SageMaker, you might host a variety of containers in a serial inference pipeline. This is pictured...