Clustering newsgroups data using k-means
The newsgroups data comes with labels, which are the categories of the newsgroups, and a number of categories that are closely related or even overlapping, for instance, the five computer groups: comp.graphics
, comp.os.ms-windows.misc
, comp.sys.ibm.pc.hardware
, comp.sys.mac.hardware
, and comp.windows.x
, and the two religion-related ones: alt.atheism
and talk.religion.misc
.
Let's now pretend we don't know those labels or they don't exist. Will samples from related topics be clustered together? We will now resort to the k-means clustering algorithm.
How does k-means clustering work?
The goal of the k-means algorithm is to partition the data into k groups based on feature similarities. K is a predefined property of a k-means clustering model. Each of the k clusters is specified by a centroid (center of a cluster) and each data sample belongs to the cluster with the nearest centroid. During training, the algorithm...