What is the best way to host my model?
As you probably expected, the answer to this question completely depends on the application you’re building. To begin, most customers start with one big question: do you need responses from your model in a real-time or synchronous manner? This would be the case for searches, recommendations, chat, and other applications. Most real-time model deployments use a hosted endpoint, which is an instance that stays on in the cloud to interact with requests. This is usually contrasted with its opposite: batch. Batch jobs take your model and inference data, spin up compute clusters to execute the inference script on all of the requested data, and spin back down. The key difference between real-time deployments and batch jobs is the amount of waiting time between new data and model inference requests. With real-time deployments, you’re getting the fastest possible model responses and paying more for the premium. With batch jobs, you won’...