Collecting data
Data collection is a critical first step in preparing a dataset for training LLMs. Let’s go through an example of preparing a dataset for an LLM-based Chrome extension that provides a Question & Answer (Q&A) interface for webpages that a user navigates to. This process involves gathering data from various sources, each of which can present unique challenges and opportunities for model training. Let’s review a few types of textual data that we’re likely to encounter.
Collecting structured data
Structured data is highly organized and easily readable by machines. It is typically stored in databases, spreadsheets, or CSV files. Each record adheres to a fixed schema with clearly defined columns and data types, making it straightforward to process and analyze. For instance, a file containing a list of Frequently Asked Questions (FAQs) and their answers from a company’s website can be used in training. Let’s go through that process here.
Imagine that we have a Parquet file (faqs.snappy.parquet
) containing a list of FAQs and their answers, sourced from a company’s website. The structure of the data in this Parquet file could resemble the following pandas DataFrame table:
+--------------------------+-----------------------------+ | question | answer | +--------------------------+-----------------------------+ | How can I reset my pa... | You can reset your passwo...| +--------------------------+-----------------------------+ | What is the return po... | Our return policy is 30... | +--------------------------+-----------------------------+ | How should I track my... | You can track your orders...| +--------------------------+-----------------------------+
We’ll need to do the following to collect the data:
- Load Parquet file: Use pandas and s3fs to read the Parquet file from the specified S3 bucket path.
- Convert to a list: Extract the question and answer columns from the DataFrame and convert them into lists.
- Create a new DataFrame: Form a new DataFrame from the list of questions and answers.
- Convert to CSV and upload: The new DataFrame is converted into the CSV format using StringIO and then uploaded back to S3 as a CSV file.
This process effectively transforms structured data from a Parquet format to a CSV format, making it more accessible or suitable for different applications or further processing needs. This example demonstrates a typical workflow for handling structured data within a cloud environment, leveraging popular Python libraries for data manipulation and AWS services for storage. Here’s the full code snippet to collect the structured data:
import pandas as pd import s3fs import boto3 from io import StringIO s3_bucket_path = 's3://your-bucket-name/path-to-your-file/faqs.snappy.parquet' fs = s3fs.S3FileSystem() df = pd.read_parquet(s3_bucket_path, engine='pyarrow', filesystem=fs) questions = df['question'].tolist() answers = df['answer'].tolist() df_new = pd.DataFrame({'question': questions, 'answer': answers}) csv_buffer = StringIO() df_new.to_csv(csv_buffer, index=False) csv_buffer.seek(0) s3_resource = boto3.resource('s3') bucket_name = 'your-bucket-name' # Replace with your actual bucket name object_key = 'path/to/your-new-file/questions_answers.csv' s3_resource.Object(bucket_name, object_key).put(Body=csv_buffer.getvalue())
Now that we have collected the structured data into a CSV file, let’s move on to semi-structured data.
Collecting semi-structured data
Semi-structured data doesn’t have a rigid structure like structured data, but it does have organizational properties that make it easier to analyze than unstructured data. JSON and XML are common formats, offering flexibility in data representation and a hierarchical structure for nesting information. As an example, JSON files exported from a customer service platform such as Jira, containing inquiry tickets with questions categorized by issue type, response texts, and metadata, can be used to help train our model. Let’s run through an example that assumes the Jira API returns the following elements:
ticket_id
: A unique identifier for each ticketissue_type
: Categorization of the ticket’s subject matterquestion
: The user’s inquiry or problem statementresponse
: The provided solution or answer to the user’s questionmetadata
: The created and updated dates and tagscreated_at
: The timestamp when the ticket was createdupdated_at
: The timestamp when the ticket was last updatedtags
: Keywords associated with the ticket for categorization or searchability
Let’s also assume that the results are returned in JSON format:
{ "tickets": [ { "ticket_id": "TICKET123", "issue_type": "Account Access", "question": "How can I recover my account?", "response": "To recover your account, please use the 'Forgot Password' option on the login page or contact support for further assistance.", "metadata": { "created_at": "2023-01-15T09:30:00Z", "updated_at": "2023-01-16T10:00:00Z", "tags": ["account", "recovery", "support"] } }, { "ticket_id": "TICKET456", "issue_type": "Payment Issue", "question": "Why was my payment declined?", "response": "Payments can be declined for several reasons, including insufficient funds, incorrect card details, or bank restrictions. Please review your payment method or contact your bank for more information.", "metadata": { "created_at": "2023-01-20T11:20:00Z", "updated_at": "2023-01-20T12:35:00Z", "tags": ["payment", "declined", "billing"] } }, ] }
Let’s now write this code process that fetches the data elements from Jira and transforms it into a Q&A CSV to be uploaded into an S3 bucket out end-to-end:
import requests import pandas as pd import boto3 from io import StringIO API_URL = "https://example.com/api/tickets" S3_BUCKET = "your-bucket-name" S3_KEY = "path/to/your-file/questions_answers.csv" AWS_ACCESS_KEY_ID = 'your-access-key' AWS_SECRET_ACCESS_KEY = 'your-secret-key' def save_to_s3(df, bucket, key): csv_buffer = StringIO() df.to_csv(csv_buffer, index=False) s3_client = boto3.client('s3', aws_access_key_id=AWS_ACCESS_KEY_ID, aws_secret_access_key=AWS_SECRET_ACCESS_KEY) s3_client.put_object(Bucket=bucket, Key=key, Body=csv_buffer.getvalue()) print(f"Successfully uploaded {key} to {bucket}")
The preceding code imports required packages, defines constants, and adds a function that saves data to S3. This will allow us to get into the processing flow that follows:
def fetch_data(api_url): response = requests.get(api_url) if response.status_code == 200: return response.json() else: print("Failed to fetch data") return None def process_data(json_data): tickets = json_data.get("tickets", []) data = [{"question": ticket["question"], "answer": ticket["response"]} for ticket in tickets] return pd.DataFrame(data) json_data = fetch_data(API_URL) if json_data: df = process_data(json_data) save_to_s3(df, S3_BUCKET, S3_KEY)
This code defines three functions that cumulatively assist in processing the data, including fetching semi-structured data from an API and extracting it into a CSV format. Now that we’ve shown how to collect semi-structured data, let’s look at collecting unstructured data.
Collecting unstructured data
Unstructured data is not organized in a predefined manner, making it the most challenging to process and analyze. This category includes text files, documents, emails, and web pages, where the information is presented as free-form text. We’ll want to incorporate web pages and their FAQs in our data to properly train the LLM to be able to answer questions about a webpage. Here’s an example of what the unstructured data, a webpage, looks like:
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>FAQ Page</title> </head> <body> <div class="faq-section"> <h2>How can I reset my password?</h2> <p>You can reset your password by going to the settings page and selecting the 'Reset Password' option.</p> <h2>What is the return policy?</h2> <p>Our return policy lasts 30 days. If 30 days have gone by since your purchase, unfortunately, we can't offer you a refund or exchange.</p> <h2>How do I track my order?</h2> <p>Once your order has been shipped, you will receive a tracking number that allows you to follow your package's journey to your doorstep.</p> </div> </body> </html>
Let’s look at a code snippet that can process this unstructured text, HTML, into a CSV file for easier data analysis and model fine-tuning:
import requests from bs4 import BeautifulSoup import pandas as pd import boto3 from io import StringIO def save_df_to_s3( df, bucket_name, object_name, aws_access_key_id, aws_secret_access_key ): csv_buffer = StringIO() df.to_csv(csv_buffer, index=False) s3_resource = boto3.resource('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key) s3_resource.Object( bucket_name, object_name ).put(Body=csv_buffer.getvalue()) print(f"File {object_name} saved to bucket {bucket_name}.")
The preceding code imports the required libraries and implements a function that saves a DataFrame to S3. Let’s look at the rest of the code now:
def fetch_faq_page(url): response = requests.get(url) if response.status_code == 200: soup = BeautifulSoup(response.text, 'html.parser') faq_sections = soup.find_all(class_='faq-section') faqs = [] for section in faq_sections: questions = section.find_all('h2') # Assuming questions are wrapped in <h2> tags answers = [q.find_next_sibling('p') for q in questions] faqs += [{'question': q.text.strip(), 'answer': a.text.strip() } for q, a in zip(questions, answers) if a] return faqs else: print("Failed to fetch the page.") return []
The preceding function implements a way to scrape a website. It extracts the scraped data into a structured format. Let’s now look at the main function of the code:
def main(): FAQ_URL = "https://example.com/faqs" AWS_ACCESS_KEY_ID = 'your_access_key' AWS_SECRET_ACCESS_KEY = 'your_secret_key' S3_BUCKET = 'your_bucket_name' S3_OBJECT_NAME = 'faqs.csv' faq_data = fetch_faq_page(FAQ_URL) if faq_data: df_faqs = pd.DataFrame(faq_data) save_df_to_s3(df_faqs, S3_BUCKET, S3_OBJECT_NAME, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) else: print("No data fetched or saved.")
The preceding code allows us to scrape FAQs from a website, assuming that the site’s robot.txt
file allows it, and transforms it into a CSV and uploads it to S3.
Now, let’s get this collected CSV data into a single, unified dataset to facilitate data exploration and model training.