Spark provides pipeline APIs under Spark ML. A pipeline comprises a sequence of stages consisting of transformers and estimators. There are two basic types of pipeline stages, called transformer and estimator:
- A transformer takes a dataset as an input and produces an augmented dataset as the output so that the output can be fed to the next step. For example, Tokenizer and HashingTF are two transformers. Tokenizer transforms a dataset with text into a dataset with tokenized words. A HashingTF, on the other hand, produces the term frequencies. The concept of tokenization and HashingTF is commonly used in text mining and text analytics.
- On the contrary, an estimator must be the first on the input dataset to produce a model. In this case, the model itself will be used as the transformer for transforming the input dataset into the augmented output dataset...