Understanding themes in document corpuses
Bag-of-words-based techniques can also be to classify common themes in documents or to identify themes within a corpus of documents. Broadly, these techniques, like most, are attempting to reduce the dimensionality of the term-document matrix, based on each word's relation to latent variables in this case.
One of the earliest approaches to this of classification was Latent Semantic Analysis (LSA). LSA can avoid the limitations of count-based methods associated with synonyms and terms with multiple meanings. Over the years, the concept of LSA has evolved into another model called Latent Dirichlet Allocation (LDA).
LDA allows us to identify latent thematic structure a collection of documents. Both LSA and LDA use the term-document matrix for reducing the dimensionality of the term space and for producing the topic weights. A constraint of both the LSA and LDA techniques is that they work best when applied to large documents.
Note
For more detailed explanation...