Analyzing word frequencies
The NLTK FreqDist
class encapsulates a dictionary of words and counts for a given list of words. Load the Gutenberg text of Julius Caesar by William Shakespeare. Let's filter out the stopwords and punctuation:
punctuation = set(string.punctuation) filtered = [w.lower() for w in words if w.lower() not in sw and w.lower() not in punctuation]
Create a FreqDist
object and print the associated keys and values with the highest frequency:
fd = nltk.FreqDist(filtered) print("Words", fd.keys()[:5]) print("Counts", fd.values()[:5])
The keys and values are printed as follows:
Words ['d', 'caesar', 'brutus', 'bru', 'haue'] Counts [215, 190, 161, 153, 148]
The first word in this list is, of course, not an English word, so we may need to add the heuristic that words have a minimum of two characters. The NLTK FreqDist
class allows dictionary-like access, but it also has convenience methods. Get the...