Building machine learning pipelines
Spark ML is an API built on top of the DataFrames API of Spark SQL to construct machine learning pipelines. Spark ML is inspired by the scikit-learn project, which makes it easier to combine multiple algorithms into a single pipeline. The following are the concepts used in ML pipelines:
DataFrame: A DataFrame is used to create rows and columns of data just like an RDBMS table. A DataFrame can contain text, feature vectors, true labels, and predictions in columns.
Transformer: A Transformer is an algorithm to transform a DataFrame into another DataFrame. The ML model is an example of a Transformer that transforms a DataFrame with features into a DataFrame with predictions.
Estimator: This is an algorithm to produce a Transformer by fitting on a DataFrame. Generating a model is an example of an Estimator.
Pipeline: As the name indicates, a pipeline creates a workflow by chaining multiple Transformers and Estimators together.
Parameter: This is an API to...