Introducing MLlib – Spam classification
Let's introduce MLlib with a concrete example. We will look at spam classification using the Ling-Spam dataset that we used in the Chapter 10, Distributed Batch Processing with Spark. We will create a spam filter that uses logistic regression to estimate the probability that a given message is spam.
We will run through examples using the Spark shell, but you will find an analogous program in LogisticRegressionDemo.scala
among the examples for this chapter. If you have not installed Spark, refer to Chapter 10, Distributed Batch Processing with Spark, for installation instructions.
Let's start by loading the e-mails in the Ling-Spam dataset. If you have not done this for Chapter 10, Distributed Batch Processing with Spark, download the data from data.scala4datascience.com/ling-spam.tar.gz or data.scala4datascience.com/ling-spam.zip, depending on whether you want a tar.gz
file or a zip
file, and unpack the archive. This will create a...