Test your knowledge
Now that we've learned ways of dealing with documents and Excel files, let's practice what we've learned. For the first task, examine a different set of PDFs of arXiv.org papers, and perform an n-gram analysis to understand what some of the common words and phrases are in the papers. Add trigrams (3-word groups) to your analysis. The ngrams
function from nltk.utils
might be helpful (you might find Stack Overflow to be helpful, or the official documentation here: http://www.nltk.org/api/nltk.html?highlight=ngram#nltk.util.ngrams). In the GitHub repository for this book, there is a collection of PDF files from arXiv with "machine learning" in the title. However, you might go to arXiv.org and download some recent papers with "machine learning" or "data science" in the title.
Be sure to write an analysis of your results, explaining what the n-gram frequency distributions are telling us.
For the second task, combine...