Gathering Data – Content is King
There is an assumption in this book: enterprise ChatGPT solutions are needed in almost all cases because a company has something unique to offer its customers, and it possesses an exceptional understanding of its products, services, and content. This content is private or unique and thus not part of large language models (LLMs) built from scraping the internet. Models are built on crawling the 2+ billion pages of web content to teach the model. A third party, Commoncrawl.org, is commonly cited as a primary source of this material for major models (GPT-3, Llama). These models, which are massive collections of text, learn the statistical relationships of words and concepts and can be used to predict and respond to questions. Creating a model can take months; most have billions of connections and words. When customers come to the enterprise for answers, the models must include enterprise content that is not part of this crawl to make them unique...