Removing special characters and punctuation
Removing special characters and punctuation is an important step in text preprocessing. Special characters and punctuation marks do not add much meaning to the text and can cause issues for machine learning models if they are not removed. One way to perform this task is by using regular expressions, such as the following:
re.sub(r"[^a-zA-Z0-9]+", "", string)
This will remove non-characters and numbers from our input string. Sometimes, there may be special characters that we would want to replace with a whitespace. Take a look at the following examples:
- president-elect
- body-type
In these two examples, we would want to replace the “-” with whitespace, as follows:
- President elect
- Body type
Next, we’ll cover stop word removal.
Stop word removal
Stop words are words that do not contribute much to the meaning of a sentence or piece of text, and therefore can...