Machine learning with SparkR
Spark version 1.5 added support for machine learning over DataFrames created in SparkR. SparkR currently supports the Generalized Linear Model, Accelerated Failure Time (AFT), Survival Regression Model, Naive Bayes Model, and K-Means algorithms in version 2.0.
Let's go through a couple of examples to understand how machine learning is implemented in SparkR.
Using the Naive Bayes model
Based on the Titanic survival dataset, let's analyze what sorts of people are likely to survive. The Titanic dataset is summarized according to economic status (class), sex, age, and survival. spark.naiveBayes()
fits a Bernoulli Naive Bayes model against a Spark DataFrame. The steps to do so are as follows:
- Create a local DataFrame and convert it to a Spark DataFrame:
> localDF <- as.data.frame(Titanic) > DF <- createDataFrame(localDF[localDF$Freq > 0, -5]) > head(DF) Class Sex Age Survived 1 3rd Male Child No 2 3rd Female Child ...