In this section, we will compare various versions of MLlib and new functionality, which has been added.
MLlib versions compared
Spark 1.6 to 2.0
The DataFrame-based API will be the primary API.
The RDD-based API is entering maintenance mode. The MLlib guide (http://spark.apache.org/docs/2.0.0/ml-guide.html) provides more details.
The following are the new features introduced in Spark 2.0:
- ML persistence: The DataFrames-based API provides support for saving and loading ML models and Pipelines in Scala, Java, Python, and R
- MLlib in R: SparkR offers MLlib APIs for generalized linear models, naive Bayes, k-means clustering, and survival regression in this release
- Python: PySpark in 2.0 supports new MLlib algorithms, LDA, Generalized Linear Regression, Gaussian Mixture...