Topic modeling with MALLET
MALLET is a well-known library in topic modeling. It also supports document classification and sequence tagging. More about MALLET can be found at http://mallet.cs.umass.edu/index.php. To download MALLET, visit http://mallet.cs.umass.edu/download.php (the latest version is 2.0.6). Once downloaded, extract MALLET in the directory. It contains the sample data in .txt
format in the sample-data/web/en
path of the MALLET directory.
The first step is to import the files into MALLET's internal format. To do this, open the Command Prompt or Terminal, move to the mallet
directory, and execute the following command:
mallet-2.0.6$ bin/mallet import-dir --input sample-data/web/en --output tutorial.mallet --keep-sequence --remove-stopwords
This command will generate the tutorial.mallet
file.
Training
The next step is to use train-topics
to build a topic model and save the output-state
, topic-keys
, and topics
using the train-topics
command:
mallet-2.0.6$ bin/mallet train-topics -...