Project overview – problem formulation
In this chapter, the stated goal is to build a spam classifier, one that is capable of distinguishing spam terms in email messages that are mixed in with regular or expected email content as well. It is important to know that spam messages are email messages that are sent out to multiple recipients with the same content, as opposed to regular messages. We start with two email datasets, one that represents ham and one that represents spam. After stages of preprocessing, we fit the model on a training set, say 70% of the entire dataset.
This application is a typical spam filtering application in the sense that it works on text. We then put algorithms to work that help the ML process detect words, phrases, and terms most likely found in spam emails. Next, will go over the ML workflow at a high level in relation to spam filtering.
The ML workflow is as follows:
- We will be developing a pipeline that will use dataframes
- A dataframe contains a
predictions
column...