Data collection and preparation
Data collection and preparation form the backbone of large language model (LLM) training and efficiency. This phase involves gathering, processing, and storing data in a manner that makes it most useful for training LLMs.
Data collection
Data collection for LLM training typically involves sourcing from a variety of public datasets that are rich in language diversity. These datasets include the following:
- Web text: Data scraped from websites, encompassing a wide range of topics and styles
- Books and publications: Texts from books, especially those in the public domain, provide a classic and varied literary perspective
- Social media feeds: Platforms such as Twitter or Reddit offer insights into colloquial and current language usage
- News articles: Datasets from news websites present formal and contemporary language
Here’s an example of what a web scraper may obtain from a news site in JSON form:
{ "...