Summary
This chapter covered the two primary patterns for designing systems around LLMs – batch and real-time. The decision depends on your organization’s use case requirements. We learned that batch mode involves sending queries in bulk for higher throughput at the expense of higher latency. It is better suited to long-running workloads and the consumption of a large corpus of data.
Results are not immediately exposed to users, allowing for additional review pipelines before or after model inference.
We also learned that real-time mode offers back-and-forth querying at a faster rate, providing quicker feedback to and from the end user. It has lower throughput but is better for low-latency requirements, but the opportunities to review results are reduced to prevent latency increases.
In this chapter, we addressed the implications of batch versus real-time processing on different components of the integration pipeline. For entry points, real-time optimizes for streamlined user prompting, while batch handles data pipeline inputs.
In pre-processing, real-time employs lighter techniques to minimize latency, whereas batch allows for heavier enrichment. Inference in real-time focuses on low latency per request, while batch processes requests in groups for improved throughput.
Post-processing in real-time involves quicker formatting and filtering, but batch processing allows for more complex transformations. In terms of presentation, real-time offers instantaneous UI updates, while batch exports results asynchronously.
Additionally, the chapter provided an example use case of using GenAI to enhance a website search, with document ingestion occurring in batch mode and search/response generation in real-time mode, transforming the end user experience, and obtaining more relevant and personalized answers.
In the next chapter, we will dive deep into a use case that leverages GenAI to extract data from 10-K documents.
Join our community on Discord
Join our community’s Discord space for discussions with the authors and other readers:
https://packt.link/genpat