Text preprocessing
Textual data can come in a variety of formats and styles. Text may be in a structured, readable format or in a more raw, unstructured format. Our text may contain punctuation and symbols that we don't wish to include in our models or may contain HTML and other non-textual formatting. This is of particular concern when scraping text from online sources. In order to prepare our text so that it can be input into any NLP models, we must perform preprocessing. This will clean our data so that it is in a standard format. In this section, we will illustrate some of these preprocessing steps in more detail.
Removing HTML
When scraping text from online sources, you may find that your text contains HTML markup and other non-textual artifacts. We do not generally want to include these in our NLP inputs for our models, so these should be removed by default. For example, in HTML, the <b>
tag indicates that the text following it should be in bold font. However...