Summary
This chapter was devoted to learning about topic models; after sentiment analysis on movie reviews, this was our second foray into working with real-life text data. This time, our predictive task was classifying the topics of news articles on the web. The primary technique for topic modeling on which we focused was LDA. This derives its name from the fact that it assumes that the topic and word distributions that can be found inside a document arise from hidden multinomial distributions that are sampled from Dirichlet priors. We saw that the generative process of sampling words and topics from these multinomial distributions mirrors many of the natural intuitions that we have about this domain; however, it signally fails to account for correlations between the various topics that can co-occur inside a document.
In our experiments with LDA, we saw that there is more than one way to fit an LDA model, and in particular we saw that the method known as Gibbs sampling tends to be more accurate...