Finding sources of data and annotating it
Data is where all natural language processing (NLP) projects start. Data can be in the form of written texts or transcribed speech. The purpose of data is to teach an NLP system what it should do when it’s given similar data in the future. Specific collections of data are also called corpora or datasets, and we will often use these terms interchangeably. Very recently, large pretrained models have been developed that greatly reduce the need for data in many applications. However, these pretrained models, which will be discussed in detail in Chapter 11, do not in most cases eliminate the need for application-specific data.
Written language data can be of any length, ranging from very short texts such as tweets to multi-page documents or even books. Written language can be interactive, such as a record of a chatbot session between a user and a system, or it can be non-interactive, such as a newspaper article or blog. Similarly, spoken...