Filtering out stopwords, names, and numbers
It's a common requirement in text analysis to get rid of stopwords (common words with low information value). NLTK has a stopwords corpora for a number of languages. Load the English stopwords corpus and print some of the words:
sw = set(nltk.corpus.stopwords.words('english')) print "Stop words", list(sw)[:7]
The following common words are printed:
Stop words ['all', 'just', 'being', 'over', 'both', 'through', 'yourselves']
Notice that all the words in this corpus are in lowercase.
NLTK also has a Gutenberg corpus. The Gutenberg project is a digital library of books mostly with expired copyright, which are available for free on the Internet (see http://www.gutenberg.org/).
Load the Gutenberg corpus and print some of its filenames:
gb = nltk.corpus.gutenberg print "Gutenberg files", gb.fileids()[-5:]
Some of the titles printed may be familiar to you:
Gutenberg files ['milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare...