Build your First RAG with Qdrant

Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!

Introduction

Large Language Models (LLM) have emerged as powerful tools for various tasks, including question-answering. However, as many are now aware, LLMs alone may not be suitable for the task of question-answering, primarily due to their limited access to up-to-date information, often resulting in incorrect or hallucinated responses. To overcome this limitation, one approach involves providing these LMs with verified facts and data. In this article, we'll explore a solution to this challenge and delve into the scalability aspect of improving question-answering using Qdrant, a vector similarity search engine and vector database.

To address the limitations of LLMs, one approach is to provide known facts alongside queries. By doing so, LLMs can utilize the actual, verifiable information and generate more accurate responses. One of the latest breakthroughs in this field is the RAG model, a tripartite approach that seamlessly combines Retrieval, Augmentation, and Generation to enhance the quality and relevance of responses generated by AI systems.

At the core of the RAG model lies the retrieval step. This initial phase involves the model searching external sources to gather relevant information. These sources can span a wide spectrum, encompassing databases, knowledge bases, sets of documents, or even search engine results. The primary objective here is to find valuable snippets or passages of text that contain information related to the given input or prompt.

The retrieval process is a vital foundation upon which RAG's capabilities are built. It allows the model to extend its knowledge beyond what is hardcoded or pre-trained, tapping into a vast reservoir of real-time or context-specific information. By accessing external sources, the model ensures that it remains up-to-date and informed, a critical aspect in a world where information changes rapidly.

Once the retrieval step is complete, the RAG model takes a critical leap forward by moving to the augmentation phase. During this step, the retrieved information is seamlessly integrated with the original input or prompt. This fusion of external knowledge with the initial context enriches the pool of information available to the model for generating responses.

Augmentation plays a pivotal role in enhancing the quality and depth of the generated responses. By incorporating external knowledge, the model becomes capable of providing more informed and accurate answers. This augmentation also aids in making the model's responses more contextually appropriate and relevant, as it now possesses a broader understanding of the topic at hand.

The final step in the RAG model's process is the generation phase. Armed with both the retrieved external information and the original input, the model sets out to craft a response that is not only accurate but also contextually rich. This last step ensures that the model can produce responses that are deeply rooted in the information it has acquired.

By drawing on this additional context, the model can generate responses that are more contextually appropriate and relevant. This is a significant departure from traditional AI models that rely solely on pre-trained data and fixed knowledge. The generation phase of RAG represents a crucial advance in AI capabilities, resulting in more informed and human-like responses.

To summarize, RAG can be utilized for the question-answering task by following the multi-step pipeline that starts with a set of documentation. These documents are converted into embeddings, essentially numerical representations, and then subjected to similarity search when a query is presented. The top N most similar document embeddings are retrieved, and the corresponding documents are selected. These documents, along with the query, are then passed to the LLM, which generates a comprehensive answer.

This approach improves the quality of question-answering but depends on two crucial variables: the quality of embeddings and the quality of the LLM itself. In this article, our focus will be on the former - enhancing the scalability of the embedding search process, with Qdrant.

Qdrant, pronounced "quadrant," is a vector similarity search engine and vector database designed to address these challenges. It provides a production-ready service with a user-friendly API for storing, searching, and managing vectors. However, what sets Qdrant apart is its enhanced filtering support, making it a versatile tool for neural-network or semantic-based matching, faceted search, and various other applications. It is built using Rust, a programming language known for its speed and reliability even under high loads, making it an ideal choice for demanding applications. The benchmarks speak for themselves, showcasing Qdrant's impressive performance.

In the quest for improving the accuracy and scalability of question-answering systems, Qdrant stands out as a valuable ally. Its capabilities in vector similarity search, coupled with the power of Rust, make it a formidable tool for any application that demands efficient and accurate search operations. Without wasting any more time, let’s take a deep breath, make yourselves comfortable, and be ready to learn how to build your first RAG with Qdrant!

Setting Up Qdrant

To get started with Qdrant, you have several installation options, each tailored to different preferences and use cases. In this guide, we'll explore the various installation methods, including Docker, building from source, the Python client, and deploying on Kubernetes.

Docker Installation

Docker is known for its simplicity and ease of use when it comes to deploying software, and Qdrant is no exception. Here's how you can get Qdrant up and running using Docker:

1. First, ensure that the Docker daemon is installed and running on your system. You can verify this with the following command:

sudo docker info

If the Docker daemon is not listed, start it to proceed. On Linux, running Docker commands typically requires sudo privileges. To run Docker commands without sudo, you can create a Docker group and add your users to it.

2. Pull the Qdrant Docker image from DockerHub:

docker pull qdrant/qdrant

3. Run the container, exposing port 6333 and specifying a directory for data storage:

docker run -p 6333:6333 -v $(pwd)/path/to/data:/qdrant/storage qdrant/qdrant

Building from Source

Building Qdrant from source is an option if you have specific requirements or prefer not to use Docker. Here's how to build Qdrant using Cargo, the Rust package manager:

Before compiling, make sure you have the necessary libraries and the Rust toolchain installed. The current list of required libraries can be found in the Dockerfile.
Build Qdrant with Cargo:

cargo build --release --bin qdrant

After a successful build, you can find the binary at ./target/release/qdrant.

Python Client

In addition to the Qdrant service itself, there is a Python client that provides additional features compared to clients generated directly from OpenAPI. To install the Python client, you can use pip:

pip install qdrant-client

This client allows you to interact with Qdrant from your Python applications, enabling seamless integration and control.

Kubernetes Deployment

If you prefer to run Qdrant in a Kubernetes cluster, you can utilize a ready-made Helm Chart. Here's how you can deploy Qdrant using Helm:

helm repo add qdrant https://qdrant.to/helm
helm install qdrant-release qdrant/qdrant

Building RAG with Qdrant and LangChain

Qdrant works seamlessly with LangChain, in fact, you can use Qdrant directly in LangChain through the `VectorDBQA` class! The first thing we need to do is to gather all documents that we want to use as the source of truth for our LLM. Let’s say we store it in the list variable named `docs`. This `docs` variable is a list of string where each element of the list consist of chunks of paragraphs.

The next thing that we need to do is to generate the embeddings from the docs. For the sake of an example, we’ll use a small model provided by the `sentence-transformers` package.

from langchain.vectorstores import Qdrant
from langchain.embeddings import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name=”sentence-transformers/all-mpnet-base-v2”)

qdrant_vec_store = Quadrant.from_texts(docs, embedding_model,
host = QDRANT_HOST,
api_key = QDRANT_API_KEY

Once we setup the embedding model and Qdrant, we can now move to the next part of RAG, which is augmentation and generation. To do that, we’ll utilize the `VectorDBQA` class. This class basically will load some docs from Qdrant and then pass them into the LLM. Once the docs are passed or augmented, the LLM then will do its job to analyze them to generate the answer to the given query. In this example, we’ll use the GPT3.5-turbo provided by OpenAI.

from langchain import OpenAI, VectorDBQA

llm = OpenAI(openai_api_key=OPENAI_API_KEY)
rag =   VectorDBQA.from_chain_type(
                                    llm=llm,
                                    chain_type=”stuff”,
                                    vectorstore=qdrant_vec_store,
                                    return_source_documents=False)

The final thing to do is to test the pipeline by passing a query to the `rag` variable and LangChain supported by Qdrant will handle the rest!

rag.run(question)

Below are some examples of the answers generated by the LLM based on the provided documents using the Natural Questions datasets.

Conclusion

Congratulations on keeping up to this point! Throughout this article, you have learned what is RAG, how it can improve the quality of your question-answering model, how to scale the embedding search part of the pipeline with Qdrant, and how to build your first RAG with Qdrant and LangChain. Hope the best for your experiment in creating your first RAG and see you in the next article!

Author Bio

Louis Owen is a data scientist/AI engineer from Indonesia who is always hungry for new knowledge. Throughout his career journey, he has worked in various fields of industry, including NGOs, e-commerce, conversational AI, OTA, Smart City, and FinTech. Outside of work, he loves to spend his time helping data science enthusiasts to become data scientists, either through his articles or through mentoring sessions. He also loves to spend his spare time doing his hobbies: watching movies and conducting side projects.

Currently, Louis is an NLP Research Engineer at Yellow.ai, the world’s leading CX automation platform. Check out Louis’ website to learn more about him! Lastly, if you have any queries or any topics to be discussed, please reach out to Louis via LinkedIn.