Let's jump into the data. The LingSpam corpus comes with four variants of the same corpus: bare, lemm, lemm_stop, and stop. In each variant, there are ten parts and each part contains multiple files. Each file represents an email. Files with a spmsg prefix in its name are spam, while the rest are ham. An example email looks as follows (from the bare variant):
Subject: re : 2 . 882 s - > np np
> date : sun , 15 dec 91 02 : 25 : 02 est > from : michael < mmorse @ vm1 . yorku . ca > > subject : re : 2 . 864 queries > > wlodek zadrozny asks if there is " anything interesting " to be said > about the construction " s > np np " . . . second , > and very much related : might we consider the construction to be a form > of what has been discussed on this list of late as reduplication ? the > logical...