Understanding text clustering
Until now, our primary goal was to assign a predefined label to a piece of text so that we could categorize it as spam or ham, label its topic, identify its sentiment, and so forth. In all of those cases, the labels were predetermined, which is the distinctive feature of supervised learning. In many other situations, however, the labels are not known from the beginning. Consider, for example, collecting feedback about a service or product using surveys. Responses to open-ended questions are essential to most questionnaires, but detecting similar themes from the answers is tedious if done manually. Other examples include news topics, customer call transcriptions, user tweets, and many more. In all the previous cases, businesses benefit from discovering insights in the chaos of unstructured data and seizing potential opportunities.
Algorithms that learn the structure of the data without any assistance (no labels or classes given) are part of unsupervised...