Document clustering
Document clustering is the process of grouping or partitioning text documents into meaningful groups. The hypothesis of the clustering algorithm is based on minimizing the distance between objects in a cluster, while keeping the intra-cluster distance at maximum.
For example, if we have a collection of news articles and we perform clustering on the collection, we will find that the similar documents are closer to each other and lie in the same cluster.
Some of the commonly used texts clustering methods are as follows:
Standard methods:
K-means
Hierarchical clustering
Specialized clustering:
Suffix tree clustering
Frequent-term set-based
Let's take a simple example of a term document matrix created from data available with tm
package in R:
library(tm) data("crude") dtm<- DocumentTermMatrix(crude,control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE), stopwords = TRUE)) dtm <<DocumentTermMatrix (documents: 20, terms: 1200)>> Non-/sparse entries: 1890...