Performing naïve Bayesian classification with MALLET
MALLET has gotten its reputation as a library for topic modeling. However, it also has a lot of other algorithms in it.
One popular algorithm that MALLET implements is naïve Bayesian classification. If you have documents that are already divided into categories, you can train a classifier to categorize new documents into those same categories. Often, this works surprisingly well.
One common use for this is in spam e-mail detection. We'll use this as our example here too.
Getting ready
We'll need to have MALLET included in our project.clj
file:
(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [cc.mallet/mallet "2.0.7"]])
Just as in the Performing topic modeling with MALLET recipe, the list of classes to be included is a little long, but most of them are for the processing pipeline, as shown here:
(require '[clojure.java...