To illustrate the concepts in this chapter, we will use the Bag of Words (https://archive.ics.uci.edu/ml/datasets/bag+of+words) dataset from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). The dataset contains information on Enron emails, such as email IDs, word IDs, and their count, which is the number of times a particular word appeared in a given email.
In the GitHub repository (https://github.com/PacktPublishing/Hands-On-Artificial-Intelligence-on-Amazon-Web-Services/tree/master/Ch9_NTM) associated with this chapter, you should find the following files:
- docword.enron.txt.gz (https://github.com/PacktPublishing/Hands-On-Artificial-Intelligence-on-Amazon-Web-Services/blob/master/Ch9_NTM/data/docword.enron.txt.gz): Contains Email ID and Word ID
- vocab.enron.txt (https://github.com/PacktPublishing/Hands-On-Artificial-Intelligence-on-Amazon...