In this chapter, we introduced topic modeling. We discussed latent semantic analysis based on truncated SVD, PLSA (which aims to build a model without assumptions about latent factor prior probabilities), and LDA, which outperformed the previous method and is based on the assumption that the latent factor has a sparse prior Dirichlet distribution. This means that a document normally covers only a limited number of topics and a topic is characterized by only a few important words.
In the last section, we discussed the basics of Word2vec and the sentiment analysis of documents, which is aimed at determining whether a piece of text expresses a positive or negative feeling. To show a feasible solution, we built a classifier based on an NLP pipeline and a random forest with average performances that can be used in many real-life situations.
In the next chapter, Chapter 15,...