E-mail subject line tester
An e-mail subject line tester is a simple program, which will define if a certain subject line in an e-mail is spam or not. In this chapter, we will program a Naïve Bayes classifier from scratch. The example will classify if a subject line is spam or not using a very simple code. This will be done by breaking the subject lines into a list of relevant words, which will be used as the features vectors in the algorithm. In order to do this, we will use the SpamAssassin public dataset. SpamAssasin includes three categories; spam, easy ham, and hard ham. In this case, we will create a binary classifier with two classes spam and not spam (easy ham).
There are several features that we can use for our classifier such as the precedence, the language, and the use of upper case. We will keep things simple and use the frequency of only those words which consist of more than three characters, avoiding words such as The or RT, when training the algorithm.
We will implement the...