Understanding the LLM Twin’s RAG inference pipeline
Before implementing the RAG inference pipeline, we want to discuss its software architecture and advanced RAG techniques. Figure 9.1 illustrates an overview of the RAG inference flow. The inference pipeline starts with the input query, retrieves the context using the retrieval module (based on the query), and calls the LLM SageMaker service to generate the final answer.
Figure 9.1: RAG inference pipeline architecture
The feature pipeline and the retrieval module, defined in Figure 9.1, are independent processes. The feature pipeline runs on a different machine on a schedule to populate the vector DB. At the same time, the retrieval module is called on demand, within the inference pipeline, on every user request.
By separating concerns between the two components, the vector DB is always populated with the latest data, ensuring feature freshness, while the retrieval module can access the latest features on...