Unsupervised learning
Another more advanced way to analyze text is with unsupervised learning. We can take the word vectors for each document, or TFIDF vectors, and use them to cluster documents with the clustering techniques we learned in last chapter.
However, this tends to not work well, with the "elbow" in the within cluster sum of squares plot often not clearly appearing, so that we don't have a clear number of clusters. A better way to look at how text groups is with topic modeling.
Topic modeling
There are many algorithms for performing topic modeling:
- Singular value decomposition (SVD), used in latent semantic analysis (LSA) and latent semantic indexing (LSI)
- Probabilistic latent semantic analysis (PLSA)
- Non-negative matrix factorization (NMF)
- Latent dirichlet allocation (LDA)
- Others, such as neural network models (for example, TopicRNN and Top2Vec)
Each of these methods has strengths and weaknesses...