Fundamental concepts for serving models
In this section, we will introduce some important topics related to how we can host our models so that our clients can interact with them.
Online and offline model serving
In ML, there are generally two options we have for serving predictions from models: online serving (also known as real-time serving) and offline serving (also known as batch serving). The high-level use cases associated with each of these methods are online (or real-time) inference and offline (or batch) inference, respectively. Let’s take a few minutes to introduce these methods and understand their use cases.
Online/real-time model serving
As the name suggests, in the case of real-time model serving, the model needs to respond “in real time” to prediction requests, which usually means that a client (perhaps a customer, or some other system) needs to receive an inference response as quickly as possible, and may be waiting synchronously for...