Filtering out stopwords, names, and numbers
Stopwords are common words that have very low information value in a text. It is a common practice in text analysis to get rid of stopwords. NLTK has a stopwords corpora for a number of languages. Load the English stopwords corpus and print some of the words:
sw = set(nltk.corpus.stopwords.words('english')) print("Stop words:", list(sw)[:7])
The following common words are printed:
Stop words: ['between', 'who', 'such', 'ourselves', 'an', 'ain', 'ours']
Note that all the words in this corpus are in lowercase.
NLTK also has a Gutenberg corpus. The Gutenberg project is a digital library of books, mostly with expired copyright, which are available for free on the Internet (see http://www.gutenberg.org/).
Load the Gutenberg corpus and print some of its filenames:
gb = nltk.corpus.gutenberg print("Gutenberg files:\n", gb.fileids()[-5:])
Some of the titles printed may be familiar to you:
Gutenberg files: ['milton-paradise.txt...