Classifying text documents using Mallet
Our final two recipes in this chapter will be the classical machine-learning classification problem-classification of documents using language modelling. In this recipe, we will be using Mallet and its command line interface to train a model and apply the model on unseen test data.
Classification in Mallet depends on three steps:
Convert your training documents into Mallet's native format.
Train your model on the training documents.
Apply the model to classify unseen test documents.
When it was mentioned that you need to convert your training documents into Mallet's native format, the technical meaning of this is to convert documents into feature vectors. You do not need to extract any feature from your training or test documents as Mallet will be taking care of this. Either you can physically separate training and testing data, or you can have one flat list of documents and segment training and testing portion from command line options.
Let us consider...