Working with Spark ML pipelines
Spark MLlib's goal is to make practical ML scalable and easy. Similar to Spark Core, MLlib provides APIs in three languages that is, Python, Scala, and Java-with example code which will ease the learning curve for users coming from different backgrounds. The pipeline API in MLlib provides a uniform set of high-level APIs built on top of DataFrames that helps users create and tune practical ML pipelines. This API is under a new package with name spark.ml
.
MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline or workflow. Let's see the key terms introduced by the pipeline API:
- DataFrame: The ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. For example, a DataFrame could have different columns storing text, feature vectors, true labels and predictions.
- Transformer: A transformer is an algorithm which can transform one DataFrame...