Gensim LDA for a larger project
Let's learn how the LDA topic modeling process changes when we have a larger set of documents and words to work with. Suppose we extend the LKML data set to include not just the 78 e-mails from January 2016, but instead, what if we use all the e-mails Linus Torvalds has ever sent to the LKML? After cleaning the data to remove missing messages, source code, attachments, Linus' own name used as a signature, and end-of-line characters, we have a single text file containing 22,546 e-mails. This e-mail text file, called lkmlLinusAll.txt
, is provided on the GitHub site for this chapter at https://github.com/megansquire/masteringDM/tree/master/ch8.
After reading these into a dictionary, our program reports that there are 26,709 unique tokens. Asking for the same four topics, five words, but asking for only one pass over this large data set yields the following topic list:
[ (0,'0.014*people + 0.013*think + 0.011*merge + 0.010*actually + 0.010*like'), (1,'0.011*fix...