The segmentation of documents
To identify the different groups of cleaned terms, based on the frequency and association of the terms in the documents of the corpus, one might directly use our tdm
matrix to run, for example, the classic hierarchical cluster algorithm.
On the other hand, if you would rather like to cluster the R packages based on their description, we should compute a new matrix with DocumentTermMatrix
, instead of the previously used TermDocumentMatrix
. Then, calling the clustering algorithm on this matrix would result in the segmentation of the packages.
For more details on the available methods, algorithms, and guidance on choosing the appropriate functions for clustering, please see Chapter 10, Classification and Clustering. For now, we will fall back to the traditional hclust
function, which provides a built-in way of running hierarchical clustering on distance matrices. For a quick demo, let's demonstrate this on the so-called Hadleyverse
, which describes a useful collection...