Removing stopwords
When we work with words, especially if we are considering words' semantics, we sometimes need to exclude some very frequent words that do not bring any substantial meaning to a sentence, words such as but, can, we, and so on. This recipe shows how to do that.
Getting ready…
For this recipe, we will need a list of stopwords. We provide a list in the book's GitHub repository. You might find that for your project, you need to customize the list and add or remove words as necessary.
You can also use the stopwords
list provided with the nltk
package.
We will be using the Sherlock Holmes text referred to earlier. For this recipe, we will need just the beginning of the book, which can be found in the sherlock_holmes_1.txt
file.
How to do it…
In the recipe, we will read in the text file, the file with stopwords
, tokenize the text file, and remove the stopwords from the list:
- Import the
csv
andnltk
modules:import csv import...