Inference, serving, and scalability
In the realm of LLMs, the topics of inference, serving, and scalability are crucial for efficient operation and optimal user experience. These aspects cover how the model’s insights are delivered (inference), how they are served to the end users (serving), and how the system adapts to varying loads (scalability).
Online and batch inference
Inference can be mainly categorized into online and batch processing. Online inference refers to the real-time processing of individual queries, where responses are generated instantly. On the other hand, batch inference deals with processing large volumes of queries at once, which is more efficient for tasks that don’t require immediate responses.
For instance, for a conversational AI chatbot used by a large retail company, online inference plays a crucial role. The chatbot is tasked with interacting with customers in real time, answering their queries, resolving issues, and providing product...