Natural Language Data – Finding and Preparing Data
This chapter will teach you how to identify and prepare data for processing with natural language understanding techniques. It will discuss data from databases, the web, and different kinds of documents, as well as privacy and ethics considerations. The Wizard of Oz technique will be covered briefly. If you don’t have access to your own data, or if you wish to compare your results to those of other researchers, this chapter will also discuss generally available and frequently used corpora. It will then go on to discuss preprocessing steps such as stemming and lemmatization.
This chapter will cover the following topics:
- Sources of data and annotation
- Ensuring privacy and observing ethical considerations
- Generally available corpora
- Preprocessing data
- Application-specific types of preprocessing
- Choosing among preprocessing techniques