How does it work?
A pipeline is a sequence of stages and each stage is either a Transformer or an Estimator. The stages are run in a sequence in a way that the input frame is transformed as it passes through each stage of the process:
- Transformer stages: The
transform()
method on the DataFrame - Estimator stages: The
fit()
method on the DataFrame
A pipeline is created by declaring its stages, configuring appropriate parameters, and then chaining them in a pipeline object. For example, if we were to create a simple classification pipeline we would tokenize the data into columns, use the hashing term feature extractor to extract features, and then build a logistic regression model.
Tip
Please ensure that you add Apache Spark ML Jar either in the class path or build that when you are doing the initial build.
Scala syntax - building a pipeline
This pipeline can be built as follows using the Scala API:
import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.linalg...