We hope it was easy to go through this chapter. Now, as usual, it is practice time. Your job is to try building your own spam detection system. We will guide you through the questions.
In this chapter's GitHub repository, you will find a dataset collected from research done by Androutsopoulos, J. Koutsias, K.V. Chandrinos, George Paliouras, and C.D. Spyropoulos: An Evaluation of Naive Bayesian Anti-Spam Filtering. Proceedings of the workshop on Machine Learning in the New Information Age, G. Potamias, V. Moustakis and M. van Someren (eds.), 11th European Conference on Machine Learning, Barcelona, Spain, pp. 9-17, 2000.
You can now prepare the data:
- The following are some text-cleaning tasks to perform:
- Clean your texts of stopwords, digits, and punctuation marks.
- Perform lemmatization.
- Create a word dictionary, including their frequencies.
In email texts, you...