Grouping similar text documents with k-means clustering methods
Computer programs face limitations in interpreting the meaning of given sentences, and therefore do not know how to group documents based on their similarities. However, if we can convert sentences into a mathematical matrix (document term matrix), a program can compute the distance between each document and group similar ones together.
In this recipe, we demonstrate how to compute the distance between text documents and how we can cluster similar text documents with the k-means method.
Getting ready
In this recipe, we use news titles as clustering input. You can find the data on the author's GitHub page at https://github.com/ywchiu/rcookbook/raw/master/chapter12/news.RData.
How to do it…
Perform the following steps to cluster text document with k-means clustering techniques:
- First, install and load the
tm
andSnowballC
packages:> install.packages('tm') > library(tm) > install.packages('SnowballC...