What does the new API look like?
When it comes to machine learning on Apache Spark, we are used to transforming data into an appropriate format and data types before we actually feed them to our algorithms. Machine learning practitioners around the globe discovered that the preprocessing tasks on a machine learning project usually follow the same pattern:
- Data preparation
- Training
- Evaluating
- Hyperparameter tuning
Therefore, the new ApacheSparkML API supports this process out of the box. It is called pipelines and is inspired by scikit-learn http://scikit-learn.org, a very popular machine learning library for the Python programming language. The central data structure is a DataFrame and all operations run on top of it.