The concept of pipelines
ApacheSparkML pipelines have the following components:
- DataFrame: This is the central data store where all the original data and intermediate results are stored in.
- Transformer: As the name suggests, a transformer transforms one DataFrame into another by adding additional (feature) columns in most of the cases. Transformers are stateless, which means that they don't have any internal memory and behave exactly the same each time they are used; this is a concept you might be familiar with when using the map function of RDDs.
- Estimator: In most of the cases, an estimator is some sort of machine learning model. In contrast to a transformer, an estimator contains an internal state representation and is highly dependent on the history of the data that it has already seen.
- Pipeline: This is the glue which is joining the preceding components, DataFrame, Transformer and Estimator, together.
- Parameter: Machine learning algorithms have many knobs to tweak. These are called hyperparameters...