Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
LLM Engineer's Handbook

You're reading from   LLM Engineer's Handbook Master the art of engineering large language models from concept to production

Arrow left icon
Product type Paperback
Published in Oct 2024
Publisher Packt
ISBN-13 9781836200079
Length 522 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (3):
Arrow left icon
Maxime Labonne Maxime Labonne
Author Profile Icon Maxime Labonne
Maxime Labonne
Paul Iusztin Paul Iusztin
Author Profile Icon Paul Iusztin
Paul Iusztin
Alex Vesa Alex Vesa
Author Profile Icon Alex Vesa
Alex Vesa
Arrow right icon
View More author details
Toc

Table of Contents (15) Chapters Close

Preface 1. Understanding the LLM Twin Concept and Architecture 2. Tooling and Installation FREE CHAPTER 3. Data Engineering 4. RAG Feature Pipeline 5. Supervised Fine-Tuning 6. Fine-Tuning with Preference Alignment 7. Evaluating LLMs 8. Inference Optimization 9. RAG Inference Pipeline 10. Inference Pipeline Deployment 11. MLOps and LLMOps 12. Other Books You May Enjoy
13. Index
Appendix: MLOps Principles

Understanding the LLM Twin’s RAG inference pipeline

Before implementing the RAG inference pipeline, we want to discuss its software architecture and advanced RAG techniques. Figure 9.1 illustrates an overview of the RAG inference flow. The inference pipeline starts with the input query, retrieves the context using the retrieval module (based on the query), and calls the LLM SageMaker service to generate the final answer.

Figure 9.1: RAG inference pipeline architecture

The feature pipeline and the retrieval module, defined in Figure 9.1, are independent processes. The feature pipeline runs on a different machine on a schedule to populate the vector DB. At the same time, the retrieval module is called on demand, within the inference pipeline, on every user request.

By separating concerns between the two components, the vector DB is always populated with the latest data, ensuring feature freshness, while the retrieval module can access the latest features on every request. The input of the RAG retrieval module is the user’s query, based on which we have to return the most relevant and similar data points from the vector DB, which will be used to guide the LLM in generating the final answer.

To fully understand the dynamics of the RAG inference pipeline, let’s go through the architecture flow from Figure 9.1 step by step:

  1. User query: We begin with the user who makes a query, such as “Write an article about...”
  2. Query expansion: We expand the initial query to generate multiple queries that reflect different aspects or interpretations of the original user query. Thus, instead of one query, we will use xN queries. By diversifying the search terms, the retrieval module increases the likelihood of capturing a comprehensive set of relevant data points. This step is crucial when the original query is too narrow or vague.
  3. Self-querying: We extract useful metadata from the original query, such as the author’s name. The extracted metadata will be used as filters for the vector search operation, eliminating redundant data points from the query vector space (making the search more accurate and faster).
  4. Filtered vector search: We embed each query and perform a similarity search to find each search’s top K data points. We execute xN searches corresponding to the number of expanded queries. We call this step a filtered vector search as we leverage the metadata extracted from the self-query step as query filters.
  5. Collecting results: We get up to xK results closest to its specific expanded query interpretation for each search operation. Further, we aggregate the results of all the xN searches, ending up with a list of N x K results containing a mix of articles, posts, and repositories chunks. The results include a broader set of potentially relevant chunks, offering multiple relevant angles based on the original query’s different facets.
  6. Reranking: To keep only the top K most relevant results from the list of N x K potential items, we must filter the list further. We will use a reranking algorithm that scores each chunk based on the relevance and importance relative to the initial user query. We will leverage a neural cross-encoder model to compute the score, a value between 0 and 1, where 1 means the result is entirely relevant to the query. Ultimately, we sort the N x K results based on the score and pick the top K items. Thus, the output is a ranked list of K chunks, with the most relevant data points situated at the top.
  7. Build the prompt and call the LLM: We map the final list of the most relevant K chunks to a string used to build the final prompt. We create the prompt using a prompt template, the retrieved context, and the user’s query. Ultimately, the augmented prompt is sent to the LLM (hosted on AWS SageMaker exposed as an API endpoint).
  8. Answer: We are waiting for the answer to be generated. After the LLM processes the prompt, the RAG logic finishes by sending the generated response to the user.

That wraps up the overview of the RAG inference pipeline. Now, let’s dig deeper into the details.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image