Support Vector Machine (SVM) with Spark 2.0
In this recipe, we use Spark's RDD-based SVM API SVMWithSGD
with SGD to classify the population into two binary classes, and then use count and BinaryClassificationMetrics
to look at model performance.
In the interest of time and space, we use the sample LIBSVM
format supplied with Spark, but provide links to additional data files offered by National Taiwan University so the reader can experiment on their own. Support Vector Machine (SVM) as a concept is fundamentally very simple, unless you want to get into the details of its implementation in Spark or any other package.
While the mathematics behind SVM is beyond the scope of this book, readers are encouraged to read the following tutorials and the original SVM paper for a deeper understanding.
The original papers are by Vapnik and Chervonenkis (1974, 1979 - in Russian) and there's also Vapnik's 1982 translation of his 1979 book:
https://www.amazon.com/Statistical-Learning-Theory-Vladimir-Vapnik...