In this chapter, we introduced topic modeling. We discussed latent semantic analysis based on truncated SVD, probabilistic latent semantic analysis (which aims to build a model without assumptions about latent factor prior probabilities), and latent Dirichlet allocation, which outperformed the previous method and is based on the assumption that the latent factor has a sparse prior Dirichlet distribution. This means that a document normally covers only a limited number of topics and a topic is characterized only by a few important words.
In the last section, we discussed sentiment analysis of documents, which is aimed at determining whether a piece of text expresses a positive or negative feeling. In order to show a feasible solution, we built a classifier based on an NLP pipeline and a random forest with average performances that can be used in many real-life situations...