Scaling inference endpoints to meet inference traffic demands
When we need a real-time inference endpoint, the processing power requirements may vary based on incoming traffic. For example, if we are providing air quality inferences for a mobile application, usage will likely fluctuate based on time of day. If we provision the inference endpoint for peak load, we will pay too much during off-peak times. If we provision the inference endpoint for a smaller load, we may hit performance bottlenecks during peak times. We can use inference endpoint auto-scaling to adjust capacity to demand.
There are two types of scaling, vertical and horizontal. Vertical scaling means that we adjust the size of an individual endpoint instance. Horizontal scaling means that we adjust the number of endpoint instances. We prefer horizontal scaling as it results in less disruption for end users; a load balancer can redistribute traffic without having an impact on end users.
There are four steps to configure...