Text clustering
Clustering is an unsupervised learning technique. Intuitively, clustering groups objects into disjoint sets. We do not know how many groups exist in the data, or what might be the commonality within these groups (clusters).
Text clustering has several applications. For example, an organizational entity may want to organize its internal documents into similar clusters based on some similarity measure. The notion of similarity or distance is central to the clustering process. Common measures used are TF-IDF and cosine similarity. Cosine similarity, or the cosine distance, is the cos product of the word frequency vectors of two documents. Spark provides a variety of clustering algorithms that can be effectively used in text analytics.
K-means
Perhaps K-means is the most intuitive of all the clustering algorithms. The idea is to segregate data points as K different clusters based on some similarity measure, say cosine distance or Euclidean distance. This algorithm that starts with...