Preprocessing data
Once the data is available, it usually needs to be cleaned or preprocessed before the actual natural language processing begins.
There are two major goals in preprocessing data. The first goal is to remove items that can’t be processed by the system – these might include items such as emojis, HTML markup, spelling errors, foreign words, or some Unicode characters such as smart quotes. There are a number of existing Python libraries that can help with this, and we’ll be showing how to use them in the next section, Removing non-text. The second goal is addressed in the section called Regularizing text. We regularize text so that differences among words in the text that are not relevant to the application’s goal can be ignored. For example, in some applications, we might want to ignore the differences between uppercase and lowercase.
There are many possible preprocessing tasks that can be helpful in preparing natural language data....