We have explored the world of machine learning and Apache Spark's support for machine learning in the last few chapters. As we discussed, machine learning has a workflow, which is explained in the following steps:
- Loading or ingesting data.
- Cleansing the data.
- Extracting features from the data.
- Training a model on the data to generate desired outcomes based on features.
- Evaluate or predict some outcome based on the data.
A simplified view of a typical pipeline is as shown in the following diagram:
Hence, there are several stages of transformation of data possible before the model is trained and then subsequently deployed. Moreover, we should expect refinement of the features and model attributes. We could even explore a completely different algorithm repeating the entire sequence of tasks as part of a new workflow.
A pipeline of steps can be...