Creating a chatbot using an LLM
In this recipe, we will create a chatbot using the LangChain framework. In the previous recipe, we learned how to ask questions to an LLM based on a piece of content. Though the LLM was able to answer questions accurately, the interaction with the LLM was completely stateless. The LLM looks at each question in isolation and ignores any previous interactions or questions that it was asked. In this recipe, we will use an LLM to create a chat interaction, wherein the LLM will be aware of the previous conversations and use the context from them to answer subsequent questions. Applications of such a framework would be to converse with document sources and get to the right answer by asking a series of questions. These document sources could be of a wide variety of types, from internal company knowledge bases to customer contact center troubleshooting guides. Our goal here is to present a basic step-by-step framework to demonstrate the essential components working together to achieve the end goal.
Getting ready
We will use a model from OpenAI in this recipe. Please refer to Model access under the Technical requirements section to complete the step to access the OpenAI model. You can use the 10.5_chatbot_with_llm.ipynb
notebook from the code site if you want to work from an existing notebook.
How to do it…
The recipe does the following things:
- It initializes the ChatGPT LLM and an embedding provider. The embedding provider is used to vectorize the document content so that a vector-based similarity search can be performed.
- It scrapes content from a webpage and breaks it into chunks.
- The text in the document chunks is vectorized and stored in a vector store.
- A conversation is started with the LLM via some curated prompts and a follow-up question is asked based on the answer provided by the LLM in the previous context.
Let’s get started:
- Do the necessary imports:
import bs4 import getpass import os from langchain_core.runnables import RunnableParallel, RunnablePassthrough from langchain_core.messages import AIMessage, HumanMessage, BaseMessage from langchain_community.vectorstores import FAISS from langchain_huggingface import HuggingFaceEmbeddings from langchain_openai import ChatOpenAI from langchain_community.document_loaders import WebBaseLoader from langchain_core.output_parsers import StrOutputParser from langchain_core.prompts import ( ChatPromptTemplate, MessagesPlaceholder) from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.prompts import ChatPromptTemplate
- In this step, we initialize the
gpt-4o-mini
model from OpenAI using the ChatOpenAI initializer:os.environ["OPENAI_API_KEY"] = getpass.getpass() llm = ChatOpenAI(model="gpt-4o-mini")
- In this step, we load the embedding provider. The content from the webpage is vectorized via the embedding provider. We use the pre-trained
sentence-transformers/all-mpnet-base-v2
model using the call to theHuggingFaceEmbeddings
constructor call. This model is a good one for encoding short sentences or a paragraph. The encoded vector representation captures the semantic context well. Please refer to the model card at https://huggingface.co/sentence-transformers/all-mpnet-base-v2 for more details:embeddings_provider = HuggingFaceEmbeddings( model_name="sentence-transformers/all-mpnet-base-v2")
- In this step, we will load a web page that has content based on which we want to ask questions. You are free to choose any webpage of your choice. We initialize a
WebBaseLoader
object and pass it the URL. We call theload
method for the loader instance. Feel free to change the link to any other webpage that you might want to use as the chat knowledge base:loader = WebBaseLoader( ["https://lilianweng.github.io/posts/2023-06-23-agent/"]) docs = loader.load()
- Initialize the text splitter instance of the
RecursiveCharacterTextSplitter
type. Use the text splitter instance to split the documents into chunks:text_splitter = RecursiveCharacterTextSplitter() document_chunks = text_splitter.split_documents(docs)
- We initialize the vector or embedding store from the document chunks that we created in the previous step. We pass it the document chunks and the embedding provider. We also initialize the vector store retriever and the output parser. The retriever will provide the augmented content to the chain via the vector store. We provided more details in the Augmenting the LLM with external content recipe from this chapter. To avoid repetition, we recommend referring to that recipe:
vectorstore = FAISS.from_documents( all_splits, HuggingFaceEmbeddings( model_name="sentence-transformers/all-mpnet-base-v2") ) retriever = vectorstore.as_retriever(search_type="similarity")
- In this step, we initialize a contextualized system prompt. A system prompt defines the persona and the instruction that is to be followed by the LLM. In this case, we use the system prompt to contain the instruction that the LLM has to use the chat history to formulate a standalone question. We initialize the prompt instance with the system prompt definition and set it up with the expectation that it will have access to the
chat_history
variable that will be passed to it at run time. We also set it up with the question template that will also be passed at run time:contextualize_q_system_prompt = """Given a chat history and the latest user question \ which might reference context in the chat history, formulate a standalone question \ which can be understood without the chat history. Do NOT answer the question, \ just reformulate it if needed and otherwise return it as is.""" contextualize_q_prompt = ChatPromptTemplate.from_messages( [ ("system", contextualize_q_system_prompt), MessagesPlaceholder(variable_name="chat_history"), ("human", "{question}"), ] )
- In this step, we initialize the contextualized chain. As you can see in the previous code snippet, we are setting up the prompt with the context and the chat history. This chain uses the chat history and a given follow-up question from the user and sets up the context for it as part of the prompt. The populated prompt template is sent to the LLM. The idea here is that the subsequent question will not provide any context and ask the question based on the chat history generated so far:
contextualize_q_chain = contextualize_q_prompt | llm | output_parser
- In this step, we initialize a system prompt, much like in the previous recipe, based on RAG. This prompt just sets up a prompt template. However, we pass this prompt a contextualized question as the chat history grows. This prompt always answers a contextualized question, barring the first one:
qa_system_prompt = """You are an assistant for question-answering tasks. \ Use the following pieces of retrieved context to answer the question. \ If you don't know the answer, just say that you don't know. \ Use three sentences maximum and keep the answer concise.\ {context}""" qa_prompt = ChatPromptTemplate.from_messages( [("system", qa_system_prompt), MessagesPlaceholder(variable_name="chat_history"), ("human", "{question}"),])
- We initialize two helper methods. The
contextualized_question
method returns the contextualized chain if a chat history exists; otherwise, it returns the input question. This is the typical scenario for the first question. Once thechat_history
is present, it returns the contextualized chain. Theformat_docs
method concatenates the page content for each document separated by two newline characters:def contextualized_question(input: dict): if input.get("chat_history"): return contextualize_q_chain else: return input["question"] def format_docs(docs): return "\n\n".join(doc.page_content for doc in docs)
- In this step, we set up a chain. We use the
RunnablePassthrough
class to set up the context. TheRunnablePassthrough
class allows us to pass the input or add additional data to the input via dictionary values. Theassign
method will take a key and will assign the value to this key. In this case, the key iscontext
and the assigned value for it is the result of the chained evaluation of the contextualized question, the retriever, and theformat_docs
. Putting that into the context of the entire recipe, for the first question, the context will use the set of matched records for the question. For the second question, the context will use the contextualized question from the chat history, retrieve a set of matching records, and pass that as the context. The LangChain framework uses a deferred execution model here. We set up the chain here with the necessary constructs such ascontext
,qa_prompt
, and the LLM. This is just setting the expectation with the chain that all these components will pipe their input to the next component when the chain is invoked. Any placeholder arguments that were set as part of the prompts will be populated and used during invocation:rag_chain = ( RunnablePassthrough.assign( context=contextualized_question | retriever | format_docs) | qa_prompt | llm )
- In this step, we initialize a chat history array. We ask a simple question to the chain by invoking it. What happens internally is the question is essentially just the first question since there is no chat history present at this point. The
rag_chain
just answers the question simply and prints the answer. We also extend thechat_history
with the returned message:chat_history = [] question = "What is a large language model?" ai_msg = rag_chain.invoke( {"question": question, "chat_history": chat_history}) print(ai_msg) chat_history.extend([HumanMessage(content=question), AIMessage(content=ai_msg)])
This results in the following output:
A large language model (LLM) is an artificial intelligence system designed to understand and generate human-like text based on the input it receives. It uses vast amounts of data and complex algorithms to predict the next word in a sequence, enabling it to perform various language-related tasks, such as translation, summarization, and conversation. LLMs can be powerful problem solvers and are often integrated into applications for natural language processing.
- In this step, we invoke the chain again with a subsequent question, without providing many contextual cues. We provide the chain with the chat history and print the answer to the second question. Internally, the
rag_chain
and thecontextualize_q_chain
work in tandem to answer this question. Thecontextualize_q_chain
uses the chat history to add more context to the follow-up question, retrieves matched records, and sends that as context to therag_chain
. Therag_chain
used the context and the contextualized question to answer the subsequent question. As we observe from the output, the LLM was able to decipher whatit
means in this context:second_question = "Can you explain the reasoning behind calling it large?" second_answer = rag_chain.invoke({"question": second_question, "chat_history": chat_history}) print(second_answer)
This results in the following output:
The term "large" in large language model refers to both the size of the model itself and the volume of data it is trained on. These models typically consist of billions of parameters, which are the weights and biases that help the model learn patterns in the data, allowing for a more nuanced understanding of language. Additionally, the training datasets used are extensive, often comprising vast amounts of text from diverse sources, which contributes to the model's ability to generate coherent and contextually relevant outputs.
Note:
We provided a basic workflow for how to execute RAG-based flows. We recommend referring to the LangChain documentation and using the necessary components to run solutions in production. Some of these would include evaluating other vector DB stores, using concrete types such as BaseChatMessageHistory
and RunnableWithMessageHistory
to better manage chat histories. Also, use LangServe to expose endpoints to serve requests.