Spam or electronic spam refers to unsolicited messages, typically carrying advertising content, infected attachments, links to phishing or malware sites, and so on. While the most widely recognized form of spam is email spam, spam abuses appear in other media as well: website comments, instant messaging, internet forums, blogs, online ads, and so on.
In this chapter, we will discuss how to build Naive Bayesian spam filtering, using BoW representation to identify spam emails. Naive Bayes spam filtering is one of the basic techniques that was implemented in the first commercial spam filters; for instance, Mozilla Thunderbird mail client uses native implementation of such filtering. While the example in this chapter will use email spam, the underlying methodology can be applied to other type of text-based spam as well.