Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Essential Guide to LLMOps

You're reading from   Essential Guide to LLMOps Implementing effective strategies for Large Language Models in deployment and continuous improvement

Arrow left icon
Product type Paperback
Published in Jul 2024
Publisher Packt
ISBN-13 9781835887509
Length 190 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Ryan Doan Ryan Doan
Author Profile Icon Ryan Doan
Ryan Doan
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Part 1: Foundations of LLMOps FREE CHAPTER
2. Chapter 1: Introduction to LLMs and LLMOps 3. Chapter 2: Reviewing LLMOps Components 4. Part 2: Tools and Strategies in LLMOps
5. Chapter 3: Processing Data in LLMOps Tools 6. Chapter 4: Developing Models via LLMOps 7. Chapter 5: LLMOps Review and Compliance 8. Part 3: Advanced LLMOps Applications and Future Outlook
9. Chapter 6: LLMOps Strategies for Inference, Serving, and Scalability 10. Chapter 7: LLMOps Monitoring and Continuous Improvement 11. Chapter 8: The Future of LLMOps and Emerging Technologies 12. Index 13. Other Books You May Enjoy

Collecting data

Data collection is a critical first step in preparing a dataset for training LLMs. Let’s go through an example of preparing a dataset for an LLM-based Chrome extension that provides a Question & Answer (Q&A) interface for webpages that a user navigates to. This process involves gathering data from various sources, each of which can present unique challenges and opportunities for model training. Let’s review a few types of textual data that we’re likely to encounter.

Collecting structured data

Structured data is highly organized and easily readable by machines. It is typically stored in databases, spreadsheets, or CSV files. Each record adheres to a fixed schema with clearly defined columns and data types, making it straightforward to process and analyze. For instance, a file containing a list of Frequently Asked Questions (FAQs) and their answers from a company’s website can be used in training. Let’s go through that process here.

Imagine that we have a Parquet file (faqs.snappy.parquet) containing a list of FAQs and their answers, sourced from a company’s website. The structure of the data in this Parquet file could resemble the following pandas DataFrame table:

+--------------------------+-----------------------------+
| question                 | answer                      |
+--------------------------+-----------------------------+
| How can I reset my pa... | You can reset your passwo...|
+--------------------------+-----------------------------+
| What is the return po... | Our return policy is 30...  |
+--------------------------+-----------------------------+
| How should I track my... | You can track your orders...|
+--------------------------+-----------------------------+

We’ll need to do the following to collect the data:

  1. Load Parquet file: Use pandas and s3fs to read the Parquet file from the specified S3 bucket path.
  2. Convert to a list: Extract the question and answer columns from the DataFrame and convert them into lists.
  3. Create a new DataFrame: Form a new DataFrame from the list of questions and answers.
  4. Convert to CSV and upload: The new DataFrame is converted into the CSV format using StringIO and then uploaded back to S3 as a CSV file.

This process effectively transforms structured data from a Parquet format to a CSV format, making it more accessible or suitable for different applications or further processing needs. This example demonstrates a typical workflow for handling structured data within a cloud environment, leveraging popular Python libraries for data manipulation and AWS services for storage. Here’s the full code snippet to collect the structured data:

import pandas as pd
import s3fs
import boto3
from io import StringIO
s3_bucket_path = 's3://your-bucket-name/path-to-your-file/faqs.snappy.parquet'
fs = s3fs.S3FileSystem()
df = pd.read_parquet(s3_bucket_path, engine='pyarrow', filesystem=fs)
questions = df['question'].tolist()
answers = df['answer'].tolist()
df_new = pd.DataFrame({'question': questions, 'answer': answers})
csv_buffer = StringIO()
df_new.to_csv(csv_buffer, index=False)
csv_buffer.seek(0)
s3_resource = boto3.resource('s3')
bucket_name = 'your-bucket-name'  # Replace with your actual bucket name
object_key = 'path/to/your-new-file/questions_answers.csv'
s3_resource.Object(bucket_name, 
    object_key).put(Body=csv_buffer.getvalue())

Now that we have collected the structured data into a CSV file, let’s move on to semi-structured data.

Collecting semi-structured data

Semi-structured data doesn’t have a rigid structure like structured data, but it does have organizational properties that make it easier to analyze than unstructured data. JSON and XML are common formats, offering flexibility in data representation and a hierarchical structure for nesting information. As an example, JSON files exported from a customer service platform such as Jira, containing inquiry tickets with questions categorized by issue type, response texts, and metadata, can be used to help train our model. Let’s run through an example that assumes the Jira API returns the following elements:

  • ticket_id: A unique identifier for each ticket
  • issue_type: Categorization of the ticket’s subject matter
  • question: The user’s inquiry or problem statement
  • response: The provided solution or answer to the user’s question
  • metadata: The created and updated dates and tags
  • created_at: The timestamp when the ticket was created
  • updated_at: The timestamp when the ticket was last updated
  • tags: Keywords associated with the ticket for categorization or searchability

Let’s also assume that the results are returned in JSON format:

{
  "tickets": [
    {
      "ticket_id": "TICKET123",
      "issue_type": "Account Access",
      "question": "How can I recover my account?",
      "response": "To recover your account, please use the 'Forgot Password' option on the login page or contact support for further assistance.",
      "metadata": {
        "created_at": "2023-01-15T09:30:00Z",
        "updated_at": "2023-01-16T10:00:00Z",
        "tags": ["account", "recovery", "support"]
      }
    },
    {
      "ticket_id": "TICKET456",
      "issue_type": "Payment Issue",
      "question": "Why was my payment declined?",
      "response": "Payments can be declined for several reasons, including insufficient funds, incorrect card details, or bank restrictions. Please review your payment method or contact your bank for more information.",
      "metadata": {
        "created_at": "2023-01-20T11:20:00Z",
        "updated_at": "2023-01-20T12:35:00Z",
        "tags": ["payment", "declined", "billing"]
      }
    },
  ]
}

Let’s now write this code process that fetches the data elements from Jira and transforms it into a Q&A CSV to be uploaded into an S3 bucket out end-to-end:

import requests
import pandas as pd
import boto3
from io import StringIO
API_URL = "https://example.com/api/tickets"
S3_BUCKET = "your-bucket-name"
S3_KEY = "path/to/your-file/questions_answers.csv"
AWS_ACCESS_KEY_ID = 'your-access-key'
AWS_SECRET_ACCESS_KEY = 'your-secret-key'
def save_to_s3(df, bucket, key):
    csv_buffer = StringIO()
    df.to_csv(csv_buffer, index=False)
    s3_client = boto3.client('s3', 
        aws_access_key_id=AWS_ACCESS_KEY_ID,
        aws_secret_access_key=AWS_SECRET_ACCESS_KEY)
    s3_client.put_object(Bucket=bucket, Key=key,
        Body=csv_buffer.getvalue())
    print(f"Successfully uploaded {key} to {bucket}")

The preceding code imports required packages, defines constants, and adds a function that saves data to S3. This will allow us to get into the processing flow that follows:

def fetch_data(api_url):
    response = requests.get(api_url)
    if response.status_code == 200:
        return response.json()
    else:
        print("Failed to fetch data")
        return None
def process_data(json_data):
    tickets = json_data.get("tickets", [])
    data = [{"question": ticket["question"],
        "answer": ticket["response"]} for ticket in tickets]
    return pd.DataFrame(data)
json_data = fetch_data(API_URL)
if json_data:
    df = process_data(json_data)
    save_to_s3(df, S3_BUCKET, S3_KEY)

This code defines three functions that cumulatively assist in processing the data, including fetching semi-structured data from an API and extracting it into a CSV format. Now that we’ve shown how to collect semi-structured data, let’s look at collecting unstructured data.

Collecting unstructured data

Unstructured data is not organized in a predefined manner, making it the most challenging to process and analyze. This category includes text files, documents, emails, and web pages, where the information is presented as free-form text. We’ll want to incorporate web pages and their FAQs in our data to properly train the LLM to be able to answer questions about a webpage. Here’s an example of what the unstructured data, a webpage, looks like:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>FAQ Page</title>
</head>
<body>
    <div class="faq-section">
        <h2>How can I reset my password?</h2>
        <p>You can reset your password by going to the settings page and selecting the 'Reset Password' option.</p>
        <h2>What is the return policy?</h2>
        <p>Our return policy lasts 30 days. If 30 days have gone by since your purchase, unfortunately, we can't offer you a refund or exchange.</p>
        <h2>How do I track my order?</h2>
        <p>Once your order has been shipped, you will receive a tracking number that allows you to follow your package's journey to your doorstep.</p>
    </div>
</body>
</html>

Let’s look at a code snippet that can process this unstructured text, HTML, into a CSV file for easier data analysis and model fine-tuning:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import boto3
from io import StringIO
def save_df_to_s3(
    df, bucket_name, object_name, aws_access_key_id,
    aws_secret_access_key
):
    csv_buffer = StringIO()
    df.to_csv(csv_buffer, index=False)
    s3_resource = boto3.resource('s3',
       aws_access_key_id=aws_access_key_id, 
        aws_secret_access_key=aws_secret_access_key)
    s3_resource.Object(
        bucket_name,
        object_name
    ).put(Body=csv_buffer.getvalue())
    print(f"File {object_name} saved to bucket {bucket_name}.")

The preceding code imports the required libraries and implements a function that saves a DataFrame to S3. Let’s look at the rest of the code now:

def fetch_faq_page(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        faq_sections = soup.find_all(class_='faq-section')
        faqs = []
        for section in faq_sections:
            questions = section.find_all('h2')  # Assuming questions are wrapped in <h2> tags
            answers = [q.find_next_sibling('p') for q in questions]
            faqs += [{'question': q.text.strip(),
                'answer': a.text.strip()
            } for q, a in zip(questions, answers) if a]
        return faqs
    else:
        print("Failed to fetch the page.")
        return []

The preceding function implements a way to scrape a website. It extracts the scraped data into a structured format. Let’s now look at the main function of the code:

def main():
    FAQ_URL = "https://example.com/faqs"
    AWS_ACCESS_KEY_ID = 'your_access_key'
    AWS_SECRET_ACCESS_KEY = 'your_secret_key'
    S3_BUCKET = 'your_bucket_name'
    S3_OBJECT_NAME = 'faqs.csv'
    faq_data = fetch_faq_page(FAQ_URL)
    if faq_data:
        df_faqs = pd.DataFrame(faq_data)
        save_df_to_s3(df_faqs, S3_BUCKET, S3_OBJECT_NAME,
            AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
    else:
        print("No data fetched or saved.")

The preceding code allows us to scrape FAQs from a website, assuming that the site’s robot.txt file allows it, and transforms it into a CSV and uploads it to S3.

Now, let’s get this collected CSV data into a single, unified dataset to facilitate data exploration and model training.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image