An example of machine learning algorithms
This section shows you an example of building a machine learning application for spam detection using RDD-based API. The next section shows an example based on DataFrame-based API.
Logistic regression for spam detection
Let's use two algorithms to build a spam classifier:
HashingTF
to build term frequency feature vectors from a text of spam and ham e-mailsLogisticRegressionWithSGD
to build a model to separate the type of messages, such as spam or ham
As you have learned how to use notebooks in Chapter 6, Notebooks and Dataflows with Spark and Hadoop, you may execute the following code in the IPython Notebook or Zeppelin Notebook. You can execute the code from the command line as well:
First of all, let's create some sample spam and ham e-mails:
[cloudera@quickstart ~]$ cat spam_messages.txt $$$ Send money 100% free Amazing stuff Home based Reverses aging No investment Send SSN and password [cloudera@quickstart ~]$ cat ham_messages.txt Thank you for...