Question Answering in LangChain

Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!

Introduction

Imagine seamlessly processing vast amounts of data, posing any question, and receiving eloquently crafted answers in return. While Large Language Models like ChatGPT excel with general data, they falter when it comes to your private information—data you'd rather not broadcast to the world. Enter LangChain: it empowers us to harness any NLP model, refining it with our exclusive data.

In this article, we'll explore LangChain, a framework designed for building applications with language models. We'll guide you through training a model, specifically the OpenAI Chat GPT, using your selected private data. While we provide a structured tutorial, feel free to adapt the steps based on your dataset and model preferences, as variations are expected and encouraged. Additionally, we'll offer various feature alternatives that you can incorporate throughout the tutorial.

The Need for Privacy in Question Answering

Undoubtedly, the confidentiality of personalized data is of absolute importance. While companies amass vast amounts of data daily, which offers them invaluable insights, it's crucial to safeguard this information. Disclosing such proprietary information to external entities could jeopardize the company's competitive edge and overall business integrity.

How Does the Fine-Tuning LangChain Process Work

Step 1: Identifying The Appropriate Data Source

Before selecting the right dataset, it's essential to ask a few preliminary questions. What specific topic are you aiming to inquire about? Is the data set sufficient? And such.

Step 2: Integrating The Data With LangChain

Based on the dataset's file format you have, you'll need to adopt different methods to effectively import the data into LangChain.

Step 3: Splitting The Data Into Chunks

To ensure efficient data processing, it's crucial to divide the dataset into smaller segments, often referred to as chunks.

Step 4: Transforming The Data Into Embeddings

Embedding is a technique where words or phrases from the vocabulary are mapped to vectors of real numbers. The idea behind embeddings is to capture the semantic meaning and relationships of words in a lower-dimensional space than the original representation.

Step 5: Asking Queries To Our Model

Finally, after training our model on the updated documentation, we can directly query it for any information we require.

Full LangChain Process

DataSet Used

LangChain's versatility stems from its ability to process varied datasets. For our demonstration, we utilize the "Giskard Documentation", a comprehensive guide on the Giskard framework.

Giskard is an open-source testing framework for Machine Learning models, spanning various Python model types. It automatically detects vulnerabilities in ML models, generates domain-specific tests, and integrates open-source QA best practices.

Having said that, LangChain can seamlessly integrate with a myriad of other data sources, be they textual, tabular, or even multimedia, expanding its use-case horizons.

Setting Up and Using LangChain for Private Question Answering

Step 1: Installing The Necessary Libraries

Allows the first step of building any machine learning model, we will have to set up our environment, making sure

!pip install langchain
!pip install openai
!pip install pypdf
!pip install tiktoken
!pip install faiss-gpu

Step 2: Importing Necessary Libraries

from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
import openai
import os

Step 3: Importing OpenAI API Key

os.environ['OPENAI_API_KEY'] = "Insert your OpenAI key here"

Step 4: Loading Our Data Set

LandChain offers the capability to load data in various formats. In this article, we'll focus on loading data in PDF format but will also touch upon other popular formats such as CSV and File Directory. For details on other file formats, please refer to the LangChain Documentation.

Loading PDF Data

We've compiled the Giskard AI tool's documentation into a PDF and subsequently partitioned the data.

loader = PyPDFLoader("/kaggle/input/giskard-documentation/Giskard Documentation.pdf")

pages = loader.load_and_split()

Below are the code snippets if you prefer to work with either CSV or File Directory file formats.

Loading CSV Data

from langchain.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(“Insert the path to your CSV dataset here”)
data = loader.load()

Loading File Directory Data

from langchain.document_loaders import DirectoryLoader

loader = DirectoryLoader('../', glob="**/*.md")
docs = loader.load()

Step 5: Indexing The Dataset

We will be creating an index using FAISS (Facebook AI Similarity Search), which is a library developed by Facebook AI for efficiently searching similarities in large datasets, especially used with vectors from machine learning models.

We will be converting those documents into vector embeddings using OpenAIEmbeddings(). This indexed data can then be used for efficient similarity searches later on.

faiss_index = FAISS.from_documents(pages, OpenAIEmbeddings())

Here are some alternative indexing options you might consider.

Indexing using Pinecone

import os

import pinecone

from langchain.schema import Document

from langchain.embeddings.openai import OpenAIEmbeddings

from langchain.vectorstores import Pinecone

pinecone.init(

api_key=os.environ["PINECONE_API_KEY"], environment=os.environ["PINECONE_ENV"]

)

embeddings = OpenAIEmbeddings()

pinecone.create_index("langchain-self-retriever-demo", dimension=1536)

Indexing using Chroma

import os
import getpass
from langchain.schema import Document
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

embeddings = OpenAIEmbeddings()

vectorstore = Chroma.from_documents(docs, embeddings)

Step 6: Asking The Model Some Questions

There are multiple methods by which we can retrieve our data from our model.

Similarity Search

In the context of large language models (LLMs) and natural language processing, similarity search is often about finding sentences, paragraphs, or documents that are semantically similar to a given sentence or piece of text.

query = "What is Giskard?"
docs = faiss_index.similarity_search(query)
print(docs[0].page_content)

Similarity Search Output: Why Giskard?Giskard is an open-source testing framework dedicated to ML models, covering any Python model, from tabular to LLMs.Testing Machine Learning applications can be tedious. Since ML models depend on data, testing scenarios depend on the domain speciﬁcities and are often inﬁnite. Where to start testing? Which tests to implement? What issues to cover? How to implement the tests?At Giskard, we believe that Machine Learning needs its own testing framework. Created by ML engineers for ML engineers, Giskard enables you to:Scan your model to ﬁnd dozens of hidden vulnerabilities: The Giskard scan automatically detects vulnerability issues such as performance bias, data leakage, unrobustness, spurious correlation, overconﬁdence, underconﬁdence, unethical issue, etc. Instantaneously generate domain-speciﬁc tests: Giskard automatically generates relevant tests based on the vulnerabilities detected by the scan. You can easily customize the tests depending on your use case by deﬁning domain-speciﬁc data slicers and transformers as ﬁxtures of your test suites.Leverage the Quality Assurance best practices of the open-source community: The Giskard catalog enables you to easily contribute and load data slicing & transformation functions such as AI-based detectors (toxicity, hate, etc.), generators (typos, paraphraser, etc.), or evaluators. Inspired by the Hugging Face philosophy, the aim of Giskard is to become.

LLM Chains

model = OpenAI(model_name="gpt-3.5-turbo")
 
my_chain = load_qa_with_sources_chain(model, chain_type="refine")
query = "What is Giskard?"
documents = faiss_index.similarity_search(query)
result = my_chain({"input_documents": pages, "question": query})

LLM Chain Output:

Based on the additional context provided, Giskard is a Python package or library that provides tools for wrapping machine learning models, testing, debugging, and inspection. It supports models from various machine learning libraries such as HuggingFace, PyTorch, TensorFlow, or Scikit-learn. Giskard can handle classification, regression, and text generation tasks using tabular or text data.

One notable feature of Giskard is the ability to upload models to the Giskard server. Uploading models to the server allows users to compare their models with others using a test suite, gather feedback from colleagues, debug models effectively in case of test failures, and develop new tests incorporating additional domain knowledge. This feature enables collaborative model evaluation and improvement.

It is worth highlighting that the provided context mentions additional ML libraries, including Langchain, API REST, and LightGBM, but their specific integration with Giskard is not clearly defined.

Sources:

Giskard Documentation.pdf
API Reference (for Dataset methods)
Kaggle: /kaggle/input/giskard-documentation/Giskard Documentation.pdf

Conclusion

LangChain effectively bridges the gap between advanced language models and the need for data privacy. Throughout this article, we have highlighted its capability to train models on private data, ensuring both insightful results and data security. One thing is for sure though, as AI continues to grow, tools like LangChain will be essential for balancing innovation with user trust.

Author Bio

Mostafa Ibrahim is a dedicated software engineer based in London, where he works in the dynamic field of Fintech. His professional journey is driven by a passion for cutting-edge technologies, particularly in the realms of machine learning and bioinformatics. When he's not immersed in coding or data analysis, Mostafa loves to travel.

Medium