Understanding the LLM Twin’s RAG inference pipeline
Before implementing the RAG inference pipeline, we want to discuss its software architecture and advanced RAG techniques. Figure 9.1 illustrates an overview of the RAG inference flow. The inference pipeline starts with the input query, retrieves the context using the retrieval module (based on the query), and calls the LLM SageMaker service to generate the final answer.
Figure 9.1: RAG inference pipeline architecture
The feature pipeline and the retrieval module, defined in Figure 9.1, are independent processes. The feature pipeline runs on a different machine on a schedule to populate the vector DB. At the same time, the retrieval module is called on demand, within the inference pipeline, on every user request.
By separating concerns between the two components, the vector DB is always populated with the latest data, ensuring feature freshness, while the retrieval module can access the latest features on every request. The input of the RAG retrieval module is the user’s query, based on which we have to return the most relevant and similar data points from the vector DB, which will be used to guide the LLM in generating the final answer.
To fully understand the dynamics of the RAG inference pipeline, let’s go through the architecture flow from Figure 9.1 step by step:
- User query: We begin with the user who makes a query, such as “Write an article about...”
- Query expansion: We expand the initial query to generate multiple queries that reflect different aspects or interpretations of the original user query. Thus, instead of one query, we will use xN queries. By diversifying the search terms, the retrieval module increases the likelihood of capturing a comprehensive set of relevant data points. This step is crucial when the original query is too narrow or vague.
- Self-querying: We extract useful metadata from the original query, such as the author’s name. The extracted metadata will be used as filters for the vector search operation, eliminating redundant data points from the query vector space (making the search more accurate and faster).
- Filtered vector search: We embed each query and perform a similarity search to find each search’s top K data points. We execute xN searches corresponding to the number of expanded queries. We call this step a filtered vector search as we leverage the metadata extracted from the self-query step as query filters.
- Collecting results: We get up to xK results closest to its specific expanded query interpretation for each search operation. Further, we aggregate the results of all the xN searches, ending up with a list of N x K results containing a mix of articles, posts, and repositories chunks. The results include a broader set of potentially relevant chunks, offering multiple relevant angles based on the original query’s different facets.
- Reranking: To keep only the top K most relevant results from the list of N x K potential items, we must filter the list further. We will use a reranking algorithm that scores each chunk based on the relevance and importance relative to the initial user query. We will leverage a neural cross-encoder model to compute the score, a value between 0 and 1, where 1 means the result is entirely relevant to the query. Ultimately, we sort the N x K results based on the score and pick the top K items. Thus, the output is a ranked list of K chunks, with the most relevant data points situated at the top.
- Build the prompt and call the LLM: We map the final list of the most relevant K chunks to a string used to build the final prompt. We create the prompt using a prompt template, the retrieved context, and the user’s query. Ultimately, the augmented prompt is sent to the LLM (hosted on AWS SageMaker exposed as an API endpoint).
- Answer: We are waiting for the answer to be generated. After the LLM processes the prompt, the RAG logic finishes by sending the generated response to the user.
That wraps up the overview of the RAG inference pipeline. Now, let’s dig deeper into the details.