Hosting real-time endpoints
SageMaker real-time inference is a fully managed feature for hosting your model(s) on compute instance(s) for real-time low-latency inference. The deployment process consists of the following steps:
- Create a model, container, and associated inference code in SageMaker. The model refers to the training artifact,
model.tar.gz
. The container is the runtime environment for the code and the model. - Create an HTTPS endpoint configuration. This configuration carries information about compute instance type and quantity, models, and traffic patterns to model variants.
- Create ML instances and an HTTPS endpoint. SageMaker creates a fleet of ML instances and an HTTPS endpoint that handles the traffic and authentication. The final step is to put everything together for a working HTTPS endpoint that can interact with client-side requests.
Hosting a real-time endpoint faces one particular challenge that is common when hosting a website or a web application...