Text preprocessing
Before we undertake text analysis, it's often helpful to undertake some common cleaning and preprocessing steps.
This often includes:
- Lowercasing
- Removing punctuation, whitespaces, and numbers
- Removing other specific text patterns (for example, emails)
- Removing stop words
- Stemming or lemmatization
Cleaning and preparing text can improve the performance of ML algorithms as well as make it easier to understand the results of analysis. We'll cover the cleaning and preparation steps we have listed in order.
Basic text cleaning
First, lowercasing is quite easy in Python. We simply take a string variable and use the built-in .lower()
method. We'll use the book War and Peace by Leo Tolstoy for our text since it's one of the most famous long books. Perhaps we can draw some conclusions about the topics of the book without reading it. The Project Gutenberg website (https://www.gutenberg.org/) will...