Harnessing Weaviate and integrating with LangChain

Introduction

In the first part of this series, we built a robust RSS news retrieval system using Weaviate, enabling us to fetch and store news articles efficiently. Now, in this second part, we're taking the next leap by exploring how to harness the power of Weaviate for similarity search and integrating it with LangChain. We will delve into the creation of a Streamlit application that performs real-time similarity search, contextual understanding, and dynamic context building. With the increasing demand for relevant and contextual information, this section will unveil the magic of seamlessly integrating various technologies to create an enhanced user experience.

Before we dive into the exciting world of similarity search and context building, let's ensure you're equipped with the necessary tools. Familiarity with Weaviate, Streamlit, and Python will be essential as we explore these advanced concepts and create a dynamic application.

Similarity Search and Weaviate Integration

The journey of enhancing news context retrieval doesn't end with fetching articles. Often, users seek not just relevant information, but also contextually similar content. This is where similarity search comes into play.

Similarity search enables us to find articles that share semantic similarities with a given query. In the context of news retrieval, it's like finding articles that discuss similar events or topics. This functionality empowers users to discover a broader range of perspectives and relevant articles.

Weaviate's core strength lies in its ability to perform fast and accurate similarity search. We utilize the perform_similarity_search function to query Weaviate for articles related to a given concept. This function returns a list of articles, each scored based on its relevance to the query.

import weaviate
from langchain.llms import OpenAI
import datetime
import pytz
from dateutil.parser import parse

davinci = OpenAI(model_name='text-davinci-003')

def perform_similarity_search(concept):
    """
    Perform a similarity search on the given concept.
    Args:
    - concept (str): The term to search for, e.g., "Bitcoin" or "Ethereum"
   
    Returns:
    - dict: A dictionary containing the result of the similarity search
    """
    client = weaviate.Client("<http://weaviate:8080>")
   
    nearText = {"concepts": [concept]}

    response = (
        client.query
        .get("RSS_Entry", ["title", "link", "summary", "publishedDate", "body"])
        .with_near_text(nearText)
        .with_limit(50)  # fetching a maximum of 50 similar entries
        .with_additional(['certainty'])
        .do()
    )
   
    return response

def sort_and_filter(results):
    # Sort results by certainty
    sorted_results = sorted(results, key=lambda x: x['_additional']['certainty'], reverse=True)

    # Sort the top results by date
    top_sorted_results = sorted(sorted_results[:50], key=lambda x: parse(x['publishedDate']), reverse=True)

    # Return the top 10 results
    return top_sorted_results[:5]

# Define the prompt template
template = """
You are a financial analysts reporting on latest developments and providing
an overview about certain topics you are asked about.
Using only the provided context, answer the following question.
Prioritize relevance and clarity in your response. If relevant information regarding the
query is not found in the context, clearly indicate this in the response asking the user to rephrase to
make the search topics more clear. If information
is found, summarize the key developments and cite the sources inline using numbers (e.g., [1]).
All sources should consistently be cited with their "Source Name", "link to the article",
and "Date and Time". List the full sources at the end in the same numerical order.

Today is: {today_date}

Context:
{context}

Question:
{query}

Answer:

Example Answer (for no relevant information):
"No relevant information regarding 'topic X' was found in the provided context."

Example Answer (for relevant information):
"The latest update on 'topic X' reveals that A and B have occurred. This was reported by 'Source Name' on 'Date and Time' [1]. Another significant development is D, as highlighted by 'Another Source Name' on 'Date and Time' [2]."

Sources (if relevant):
[1] Source Name, "link to the article provided in the context", Date and Time
[2] Another Source Name, "link to the article provided in the context", Date and Time
"""

# Modified the generate_response function to now use the SQL agent
def query_db(query):

    # Query the weaviate database
    results = perform_similarity_search(query)
    results = results['data']['Get']['RSS_Entry']
    top_results = sort_and_filter(results)

    # Convert your context data into a readable string
    context_string = [f"title:{r['title']}\\nsummary:{r['summary']}\\nbody:{r['body']}\\nlink:{r['link']}\\npublishedDate:{r['publishedDate']}\\n\\n" for r in top_results]
    context_string = '\\n'.join(context_string)

    # Get today's date
    date_format = "%a, %d %b %Y %H:%M:%S %Z"
    today_date = datetime.datetime.now(pytz.utc).strftime(date_format)
    # Format the prompt
    prompt = template.format(
        query=query,
        context=context_string,
        today_date=today_date
    )

    # Print the formatted prompt for verification
    print(prompt)

    # Run the prompt through the model directly
    response = davinci(prompt)

    # Extract and print the response
    return response

Retrieved results need effective organization for user consumption. The sort_and_filter function handles this task. It first sorts the results based on their certainty scores, ensuring the most relevant articles are prioritized. Then, it further sorts the top results by their published dates, providing users with the latest information to build the context for the LLM.

LangChain Integration for Context Building

While similarity search enhances content discovery, context is the key to understanding the significance of articles. Integrating LangChain with Weaviate allows us to dynamically build context and provide more informative responses.

LangChain, a language manipulation tool, acts as our context builder. It enhances the user experience by constructing context around the retrieved articles, enabling users to understand the broader narrative. Our modified query_db function now incorporates Langchain's capabilities. The function generates a context-rich prompt that combines the user's query and the top retrieved articles. This prompt is structured using a template that ensures clarity and relevance.

The prompt template is a structured piece of text that guides LangChain to generate contextually meaningful responses. It dynamically includes information about the query, context, and relevant articles. This ensures that users receive comprehensive and informative answers.

Subsection 2.4: Handling Irrelevant Queries One of LangChain's unique strengths is its ability to gracefully handle queries with limited context. When no relevant information is found in the context, LangChain generates a response that informs the user about the absence of relevant data. This ensures transparency and guides users to refine their queries for better results.

In the next section, we will be integrating this enhanced news retrieval system with a Streamlit application, providing users with an intuitive interface to access relevant and contextual information effortlessly.

Building the Streamlit Application

In the previous section, we explored the intricate layers of building a robust news context retrieval system using Weaviate and LangChain. Now, in this third part, we're diving into the realm of user experience enhancement by creating a Streamlit application. Streamlit empowers us to transform our backend functionalities into a user-friendly front-end interface with minimal effort. Let's discover how we can harness the power of Streamlit to provide users with a seamless and intuitive way to access relevant news articles and context.

Streamlit is a Python library that enables developers to create interactive web applications with minimal code. Its simplicity, coupled with its ability to provide real-time visualizations, makes it a fantastic choice for creating data-driven applications.

The structure of a Streamlit app is straightforward yet powerful. Streamlit apps are composed of simple Python scripts that leverage the provided Streamlit API functions. This section will provide an overview of how the Streamlit app is structured and how its components interact.

import feedparser
import pandas as pd
import time
from bs4 import BeautifulSoup
import requests
import random
from datetime import datetime, timedelta
import pytz
import uuid
import weaviate
import json

import time

def wait_for_weaviate():
    """Wait until Weaviate is available."""
   
    while True:
        try:
            # Try fetching the Weaviate metadata without initiating the client here
            response = requests.get("<http://weaviate:8080/v1/meta>")
            response.raise_for_status()
            meta = response.json()
           
            # If successful, the instance is up and running
            if meta:
                print("Weaviate is up and running!")
                return

        except (requests.exceptions.RequestException):
            # If there's any error (connection, timeout, etc.), wait and try again
            print("Waiting for Weaviate...")
            time.sleep(5)

RSS_URLS = [
    "<https://thedefiant.io/api/feed>",
    "<https://cointelegraph.com/rss>",
    "<https://cryptopotato.com/feed/>",
    "<https://cryptoslate.com/feed/>",
    "<https://cryptonews.com/news/feed/>",
    "<https://smartliquidity.info/feed/>",
    "<https://bitcoinmagazine.com/feed>",
    "<https://decrypt.co/feed>",
    "<https://bitcoinist.com/feed/>",
    "<https://cryptobriefing.com/feed>",
    "<https://www.newsbtc.com/feed/>",
    "<https://coinjournal.net/feed/>",
    "<https://ambcrypto.com/feed/>",
    "<https://www.the-blockchain.com/feed/>"
]

def get_article_body(link):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.3'}
        response = requests.get(link, headers=headers, timeout=10)
        response.raise_for_status()

        soup = BeautifulSoup(response.content, 'html.parser')
        paragraphs = soup.find_all('p')

        # Directly return list of non-empty paragraphs
        return [p.get_text().strip() for p in paragraphs if p.get_text().strip() != ""]

    except Exception as e:
        print(f"Error fetching article body for {link}. Reason: {e}")
        return []

def parse_date(date_str):
    # Current date format from the RSS
    date_format = "%a, %d %b %Y %H:%M:%S %z"
    try:
        dt = datetime.strptime(date_str, date_format)
        # Ensure the datetime is in UTC
        return dt.astimezone(pytz.utc)
    except ValueError:
        # Attempt to handle other possible formats
        date_format = "%a, %d %b %Y %H:%M:%S %Z"
        dt = datetime.strptime(date_str, date_format)
        return dt.replace(tzinfo=pytz.utc)

def fetch_rss(from_datetime=None):
    all_data = []
    all_entries = []
   
    # Step 1: Fetch all the entries from the RSS feeds and filter them by date.
    for url in RSS_URLS:
        print(f"Fetching {url}")
        feed = feedparser.parse(url)
        entries = feed.entries
        print('feed.entries', len(entries))

        for entry in feed.entries:
            entry_date = parse_date(entry.published)
           
            # Filter the entries based on the provided date
            if from_datetime and entry_date <= from_datetime:
                continue

            # Storing only necessary data to minimize memory usage
            all_entries.append({
                "Title": entry.title,
                "Link": entry.link,
                "Summary": entry.summary,
                "PublishedDate": entry.published
            })

    # Step 2: Shuffle the filtered entries.
    random.shuffle(all_entries)

    # Step 3: Extract the body for each entry and break it down by paragraphs.
    for entry in all_entries:
        article_body = get_article_body(entry["Link"])

        print("\\nTitle:", entry["Title"])
        print("Link:", entry["Link"])
        print("Summary:", entry["Summary"])
        print("Published Date:", entry["PublishedDate"])

        # Create separate records for each paragraph
        for paragraph in article_body:
            data = {
                "UUID": str(uuid.uuid4()), # UUID for each paragraph
                "Title": entry["Title"],
                "Link": entry["Link"],
                "Summary": entry["Summary"],
                "PublishedDate": entry["PublishedDate"],
                "Body": paragraph
            }
            all_data.append(data)

    print("-" * 50)

    df = pd.DataFrame(all_data)
    return df

def insert_data(df,batch_size=100):
    # Initialize the batch process
    with client.batch as batch:
        batch.batch_size = 100

        # Loop through and batch import the 'RSS_Entry' data
        for i, row in df.iterrows():
            if i%100==0:
                print(f"Importing entry: {i+1}")  # Status update

            properties = {
                "UUID": row["UUID"],
                "Title": row["Title"],
                "Link": row["Link"],
                "Summary": row["Summary"],
                "PublishedDate": row["PublishedDate"],
                "Body": row["Body"]
            }

            client.batch.add_data_object(properties, "RSS_Entry")

if __name__ == "__main__":

    # Wait until weaviate is available
    wait_for_weaviate()

    # Initialize the Weaviate client
    client = weaviate.Client("<http://weaviate:8080>")
    client.timeout_config = (3, 200)

    # Reset the schema
    client.schema.delete_all()

    # Define the "RSS_Entry" class
    class_obj = {
        "class": "RSS_Entry",
        "description": "An entry from an RSS feed",
        "properties": [
            {"dataType": ["text"], "description": "UUID of the entry", "name": "UUID"},
            {"dataType": ["text"], "description": "Title of the entry", "name": "Title"},
            {"dataType": ["text"], "description": "Link of the entry", "name": "Link"},
            {"dataType": ["text"], "description": "Summary of the entry", "name": "Summary"},
            {"dataType": ["text"], "description": "Published Date of the entry", "name": "PublishedDate"},
            {"dataType": ["text"], "description": "Body of the entry", "name": "Body"}
        ],
        "vectorizer": "text2vec-transformers"
    }

    # Add the schema
    client.schema.create_class(class_obj)

    # Retrieve the schema
    schema = client.schema.get()
    # Display the schema
    print(json.dumps(schema, indent=4))
    print("-"*50)

    # Current datetime
    now = datetime.now(pytz.utc)

    # Fetching articles from the last days
    days_ago = 3
    print(f"Getting historical data for the last {days_ago} days ago.")
    last_week = now - timedelta(days=days_ago)
    df_hist =  fetch_rss(last_week)

    print("Head")
    print(df_hist.head().to_string())
    print("Tail")
    print(df_hist.head().to_string())
    print("-"*50)
    print("Total records fetched:",len(df_hist))
    print("-"*50)
    print("Inserting data")

    # insert historical data
    insert_data(df_hist,batch_size=100)

    print("-"*50)
    print("Data Inserted")

    # check if there is any relevant news in the last minute

    while True:
        # Current datetime
        now = datetime.now(pytz.utc)

        # Fetching articles from the last hour
        one_min_ago = now - timedelta(minutes=1)
        df =  fetch_rss(one_min_ago)
        print("Head")
        print(df.head().to_string())
        print("Tail")
        print(df.head().to_string())
       
        print("Inserting data")

        # insert minute data
        insert_data(df,batch_size=100)

        print("data inserted")

        print("-"*50)

        # Sleep for a minute
        time.sleep(60)

Streamlit apps rely on specific Python libraries and functions to operate smoothly. We'll explore the libraries used in our Streamlit app, such as streamlit, weaviate, and langchain, and discuss their roles in enabling real-time context retrieval.

Demonstrating Real-time Context Retrieval

As we bring together the various elements of our news retrieval system, it's time to experience the magic firsthand by using the Streamlit app to perform real-time context retrieval.

The Streamlit app's interface, showcasing how users can input queries and initiate similarity searches ensures a user-friendly experience, allowing users to effortlessly interact with the underlying Weaviate and LangChain-powered functionalities. The Streamlit app acts as a bridge, making complex interactions accessible to users through a clean and intuitive interface.

harnessing-weaviate-and-integrating-with-langchain-img-0

The true power of our application shines when we demonstrate its ability to provide context for user queries and how LangChain dynamically builds context around retrieved articles and responses, creating a comprehensive narrative that enhances user understanding.

Conclusion

In this second part of our series, we've embarked on the journey of creating an interactive and intuitive user interface using Streamlit. By weaving together the capabilities of Weaviate, LangChain, and Streamlit, we've established a powerful framework for context-based news retrieval. The Streamlit app showcases how the integration of these technologies can simplify complex processes, allowing users to effortlessly retrieve news articles and their contextual significance. As we wrap up our series, the next step is to dive into the provided code and experience the synergy of these technologies firsthand. Empower your applications with the ability to deliver context-rich and relevant information, bringing a new level of user experience to modern data-driven platforms.

Through these two articles, we've embarked on a journey to build an intelligent news retrieval system that leverages cutting-edge technologies. We've explored the foundations of Weaviate, delved into similarity search, harnessed LangChain for context building, and created a Streamlit application to provide users with a seamless experience. In the modern landscape of information retrieval, context is key, and the integration of these technologies empowers us to provide users with not just data, but understanding. As you venture forward, remember that these concepts are stepping stones. Embrace the code, experiment, and extend these ideas to create applications that offer tailored and relevant experiences to your users.

Author Bio

Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, and Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder of startups, and later on earned a Master's degree from the faculty of Mathematics at the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.