Collecting data
Data collection is a critical first step in preparing a dataset for training LLMs. Let’s go through an example of preparing a dataset for an LLM-based Chrome extension that provides a Question & Answer (Q&A) interface for webpages that a user navigates to. This process involves gathering data from various sources, each of which can present unique challenges and opportunities for model training. Let’s review a few types of textual data that we’re likely to encounter.
Collecting structured data
Structured data is highly organized and easily readable by machines. It is typically stored in databases, spreadsheets, or CSV files. Each record adheres to a fixed schema with clearly defined columns and data types, making it straightforward to process and analyze. For instance, a file containing a list of Frequently Asked Questions (FAQs) and their answers from a company’s website can be used in training. Let’s go through that...