MLlib and the Pipeline API
Let us first learn some Spark fundamentals to be able to perform the machine learning operations on it. We will discuss the MLlib and the pipeline API in this section.
MLlib
MLlib is the machine learning library built on top of Apache Spark which homes most of the algorithms that can be implemented at scale. The seamless integration of MLlib with other components such as GraphX, SQL, and Streaming provides developers with an opportunity to assemble complex, scalable, and efficient workflows relatively easily. The MLlib library consists of common learning algorithms and utilities including classification, regression, clustering, collaborative filtering, and dimensionality reduction.
MLlib works in conjunction with the spark.ml
package which provides a high level Pipeline API. The fundamental difference between these two packages is that MLlib (spark.mllib
) works on top of RDDs whereas the ML (spark.ml
) package works on top of DataFrames and supports ML Pipeline. Currently...