Spam detection is a common classification problem. In the following recipe, we have the corpus of raw text or documents, including labels of those documents marked spam or no spam. The data source here is the SMS Spam Collection v.1, which is a public set of SMS labeled messages that have been collected for mobile phone spam research.
The dataset can be downloaded from http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/. The following table lists the provided dataset in different file formats, the number of samples in each class, and the total number of samples:
Application | File format | # Spam | # Ham | Total | Link |
---|---|---|---|---|---|
General | Plain text | 747 | 4,827 | 5,574 |
http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/smsspamcollection.zip |
Weka | ARFF | 747 | 4,827 | 5,574 | http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/smsSpamCollection.arff |