Analyzing Vulnerability Assessment Reports using LangChain
As powerful as ChatGPT and the OpenAI API are, they currently have a significant limitation—the token window. This window determines how many characters can be exchanged in a complete message between the user and ChatGPT. Once the token count exceeds this limitation, ChatGPT may lose track of the original context, making the analysis of large bodies of text or documents challenging.
Enter LangChain—a framework designed to navigate around this very hurdle. LangChain allows us to embed and vectorize large groups of text.
Important note
Embedding refers to the process of transforming text into numerical vectors that an ML model can understand and process. Vectorizing, on the other hand, is a technique to encode non-numeric features as numbers. By converting large bodies of text into vectors, we can enable ChatGPT to access and analyze vast amounts of information, effectively turning the text into a knowledgebase that the model can refer to, even if it hasn’t been trained on this data previously.
In this recipe, we will leverage the power of LangChain, Python, the OpenAI API, and Streamlit (a framework for quickly and easily creating web applications) to analyze voluminous documents such as vulnerability assessment reports, threat reports, standards, and more. With a simple UI for uploading files and crafting prompts, the task of analyzing these documents will be simplified to the point of asking ChatGPT straightforward natural language queries.
Getting ready
Before we start with the recipe, ensure that you have an OpenAI account set up and have obtained your API key. If you haven’t done this yet, please revisit Chapter 1 for the steps. Apart from this, you’ll also need the following:
- Python libraries: Ensure that you have the necessary Python libraries installed in your environment. You’ll specifically need libraries such as
python-docx
,langchain
,streamlit
, andopenai
. You can install these using thepip install
command as follows:pip install python-docx langchain streamlit openai
- Vulnerability assessment report (or a large document of your choice to be analyzed): Prepare a vulnerability assessment report or any other substantial document that you aim to analyze. The document can be in any format as long as you can convert it into a PDF.
- Access to LangChain documentation: Throughout this recipe, we will be utilizing LangChain, a relatively new framework. Although we will walk you through the process, having the LangChain documentation handy might be beneficial. You can access it at https://docs.langchain.com/docs/.
- Streamlit: We will be using Streamlit, a fast and straightforward way to create web apps for Python scripts. While we will guide you through the basics in this recipe, you may want to explore it on your own. You can learn more about Streamlit at https://streamlit.io/.
How to do it…
In this recipe, we’ll walk you through the steps needed to create a document analyzer using LangChain, Streamlit, OpenAI, and Python. The application will allow you to upload a PDF document, ask questions about it in natural language, and get responses generated by the language model based on the document’s content:
- Set up the environment and import required modules: Start by importing all the required modules. You’ll need
dotenv
to load environment variables,streamlit
to create the web interface,PyPDF2
to read the PDF files, and various components fromlangchain
to handle the language model and text processing:import streamlit as st from PyPDF2 import PdfReader from langchain.text_splitter import CharacterTextSplitter from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import FAISS from langchain.chains.question_answering import load_qa_chain from langchain.llms import OpenAI from langchain.callbacks import get_openai_callback
- Initialize the Streamlit application: Set up the Streamlit page and header. This will create a web application with the title
"Document Analyzer"
and a"What would you like to know about this document?"
header text prompt:def main(): st.set_page_config(page_title="Document Analyzer") st.header("What would you like to know about this document?")
- Upload the PDF: Add a file uploader to the Streamlit application to allow users to upload a PDF document:
pdf = st.file_uploader("Upload your PDF", type="pdf")
- Extract the text from the PDF: If a PDF is uploaded, read the PDF and extract the text from it:
if pdf is not None: pdf_reader = PdfReader(pdf) text = "" for page in pdf_reader.pages: text += page.extract_text()
- Split the text into chunks: Break down the extracted text into manageable chunks that can be processed by the language model:
text_splitter = CharacterTextSplitter( separator="\n", chunk_size=1000, chunk_overlap=200, length_function=len ) chunks = text_splitter.split_text(text) if not chunks: st.write("No text chunks were extracted from the PDF.") return
- Create embeddings: Use
OpenAIEmbeddings
to create vector representations of the chunks:embeddings = OpenAIEmbeddings() if not embeddings: st.write("No embeddings found.") return knowledge_base = FAISS.from_texts(chunks, embeddings)
- Ask a question about the PDF: Show a text input field in the Streamlit application for the user to ask a question about the uploaded PDF:
user_question = st.text_input("Ask a question about your PDF:")
- Generate a response: If the user asks a question, find the chunks that are semantically similar to the question, feed those chunks to the language model, and generate a response:
if user_question: docs = knowledge_base.similarity_search(user_question) llm = OpenAI() chain = load_qa_chain(llm, chain_type="stuff") with get_openai_callback()
- Run the script with Streamlit. Using a command-line terminal, run the following command from the same directory as the script:
streamlit run app.py
- Open browse to
localhost
using a web browser.
Here is how the completed script should look:
import streamlit as st from PyPDF2 import PdfReader from langchain.text_splitter import CharacterTextSplitter from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import FAISS from langchain.chains.question_answering import load_qa_chain from langchain.llms import OpenAI from langchain.callbacks import get_openai_callback def main(): st.set_page_config(page_title="Ask your PDF") st.header("Ask your PDF") # upload file pdf = st.file_uploader("Upload your PDF", type="pdf") # extract the text if pdf is not None: pdf_reader = PdfReader(pdf) text = "" for page in pdf_reader.pages: text += page.extract_text() # split into chunks text_splitter = CharacterTextSplitter( separator="\n", chunk_size=1000, chunk_overlap=200, length_function=len ) chunks = text_splitter.split_text(text) if not chunks: st.write("No text chunks were extracted from the PDF.") return # create embeddings embeddings = OpenAIEmbeddings() if not embeddings: st.write("No embeddings found.") return knowledge_base = FAISS.from_texts(chunks, embeddings) # show user input user_question = st.text_input("Ask a question about your PDF:") if user_question: docs = knowledge_base.similarity_search(user_question) llm = OpenAI() chain = load_qa_chain(llm, chain_type="stuff") with get_openai_callback() as cb: response = chain.run(input_documents=docs, question=user_question) print(cb) st.write(response) if __name__ == '__main__': main()
The script essentially automates the analysis of large documents, such as vulnerability assessment reports, using the LangChain framework, Python, and OpenAI. It leverages Streamlit to create an intuitive web interface where users can upload a PDF file for analysis.
The uploaded document undergoes a series of operations: it’s read and its text is extracted, then split into manageable chunks. These chunks are transformed into vector representations (embeddings) using OpenAI Embeddings, enabling the language model to interpret and process the text semantically. These embeddings are stored in a database (Facebook AI Similarity Search, or FAISS for short), facilitating efficient similarity searches.
The script then provides an interface for users to ask questions about the uploaded document. Upon receiving a question, it identifies the most semantically relevant chunks of text to the question from the database. These chunks, along with the user’s question, are processed by a question-answering chain in LangChain, generating a response that is displayed back to the user.
In essence, this script transforms large, unstructured documents into an interactive knowledge base, enabling users to pose questions and receive AI-generated responses based on the document’s content.
How it works…
- First, the necessary modules are imported. These include the
dotenv
module for loading environment variables,streamlit
for creating the application’s UI,PyPDF2
for handling PDF documents, and various modules fromlangchain
for handling language model tasks. - The Streamlit application’s page configuration is set and a file uploader is created that accepts PDF files. Once a PDF file is uploaded, the application uses
PyPDF2
to read the text of the PDF. - The text from the PDF is then split into smaller chunks using LangChain’s CharacterTextSplitter. This ensures that the text can be processed within the language model’s maximum token limit. The chunking parameters—
chunk
size
,overlap
, andseparator
, used to split the text—are specified. - Next, OpenAI Embeddings from LangChain are used to convert the chunks of text into vector representations. This involves encoding the semantic information of the text into a mathematical form that can be processed by the language model. These embeddings are stored in a FAISS database, which allows efficient similarity searching for high-dimensional vectors.
- The application then takes a user input in the form of a question about the PDF. It uses the FAISS database to find the chunks of text that are semantically most similar to the question. These chunks are likely to contain the information needed to answer the question.
- The chosen chunks of text and the user’s question are fed into a question-answering chain from LangChain. This chain is loaded with an instance of the OpenAI language model. The chain processes the input documents and the question, using the language model to generate a response.
- The OpenAI callback is used to capture metadata about the API usage, such as the number of tokens used in the request.
- Finally, the response from the chain is displayed in the Streamlit application.
This process allows for semantic querying of large documents that exceed the language model’s token limit. By splitting the document into smaller chunks and using semantic similarity to find the chunks most relevant to a user’s question, the application can provide useful answers even when the entire document can’t be processed at once by the language model. This demonstrates one way to overcome the token limit challenge when working with large documents and language models.
There’s more…
LangChain is not just a tool for overcoming the token window limitation; it’s a comprehensive framework for creating applications that interact intelligently with language models. These applications can connect a language model to other data sources and allow the model to interact with its environment—essentially providing the model with a degree of agency. LangChain offers modular abstractions for the components necessary to work with language models, along with a collection of implementations for these abstractions. Designed for ease of use, these components can be employed whether you’re using the full LangChain framework or not.
What’s more, LangChain introduces the concept of chains—these are combinations of the aforementioned components, assembled in specific ways to accomplish particular use cases. Chains offer a high-level interface for users to get started with a specific use case easily and are designed to be customizable to cater to a variety of tasks.
In later recipes, we’ll demonstrate how to use these features of LangChain to analyze even larger and more complex documents, such as .csv
files and spreadsheets.