Topics in text mining
As we saw in the first section, the area of text mining and performing Machine Learning on text spans a wide range of topics. Each topic discussed has some customizations to the mainstream algorithms, or there are specific algorithms that have been developed to perform the task called for in that area. We have chosen four broad topics, namely, text categorization, topic modeling, text clustering, and named entity recognition, and will discuss each in some detail.
Text categorization/classification
The text classification problem manifests itself in different applications, such as document filtering and organization, information retrieval, opinion and sentiment mining, e-mail spam filtering, and so on. Similar to the classification problem discussed in Chapter 2, Practical Approach to Real-World Supervised Learning, the general idea is to train on the training data with labels and to predict the labels of unseen documents.
As discussed in the previous section, the preprocessing...