We can remove words with low occurences by leveraging the ability to find words with low frequency counts, that fall outside of a certain deviation of the norm, or just from a list of words considered to be rare within the given domain. But the technique we will use works the same for either.
Identifying and removing rare words
How to do it
Rare words can be removed by building a list of those rare words and then removing them from the set of tokens being processed. The list of rare words can be determined by using the frequency distribution provided by NTLK. Then you decide what threshold should be used as a rare word threshold:
- The script in the 07/07_rare_words.py file extends that of the frequency distribution recipe...