The next step we will take to prepare for text analysis is doing some preliminary cleaning. This is a common way to get started, regardless of what machine learning method will be applied later. When working with text, there are several terms and patterns that will not provide meaningful information. Some of these terms are generally not useful and steps to remove these pieces of text data can be used every time, while others will be more context-dependent.
As previously noted, there are collections of terms referred to as stop words. These terms have no information value and can usually be removed. To remove stop words from our data, we use the following code:
word_tokens <- word_tokens %>%
filter(!word %in% stop_words$word)
After running the preceding code, our row count goes down from 3.5 million to 1.7 million. In effect, our data (word_tokens...