Summary
In this chapter, we covered how to find and use natural language data, including finding data for a specific application as well as using generally available corpora.
We discussed a wide variety of techniques for preparing data for NLP, including annotation, which provides the foundation for supervised learning. We also discussed common preprocessing steps that remove noise and decrease variation in the data and allow machine learning algorithms to focus on the most informative differences among different categories of texts. Another important set of topics covered in this chapter had to do with privacy and ethics – how to ensure the privacy of information included in text data and how to ensure that crowdsourcing workers who are generating data or who are annotating data are treated fairly.
The next chapter will discuss exploratory techniques for getting an overall picture of a dataset, such as summary statistics (word frequencies, category frequencies, and so...