Classification with Naive Bayes
This section will provide a working example of the Apache Spark MLlib Naive Bayes algorithm. It will describe the theory behind the algorithm and will provide a step-by-step example in Scala to show how the algorithm may be used.
Theory on Classification
In order to use the Naive Bayes algorithm to classify a dataset, the data must be linearly divisible; that is, the classes within the data must be linearly divisible by class boundaries. The following figure visually explains this with three datasets and two class boundaries shown via the dotted lines:
Naive Bayes assumes that the features (or dimensions) within a dataset are independent of one another; that is, they have no effect on each other. The following example considers the classification of e-mails as spam. If you have 100 e-mails, then perform the following:
60% of emails are spam
80% of spam emails contain the word buy
20% of spam emails don't contain the word buy
40% of emails are not spam
10% of non...