Machine learning
SparkR provides wrappers on existing MLLib functions. R formulas are implemented as MLLib feature transformers. A transformer is an ML pipeline (spark.ml
) stage that takes a DataFrame as input and produces another DataFrame as output, which generally contains some appended columns. Feature transformers are a type of transformers that convert input columns to feature vectors and these feature vectors are appended to the source DataFrame. For example, in linear regression, string input columns are one-hot encoded and numeric values are converted to doubles. A label column will be appended (if not there in the data frame already) as a replica of the response variable.
In this section, we cover example code for the Naive Bayes and Gaussian GLM models. We do not explain the models as such or the summaries they produce. Instead, we go straight away to how it can be done using SparkR.
The Naive Bayes model
The Naïve Bayes model is an intuitively simple model that works with categorical...