A pipelined analysis is a series of steps stored as a single function or object. On top of providing a framework for your analysis, the most important reason for pipelining becomes apparent upon examining what is required to reproduce your workflow or apply it to new data. Now that you've seen a nice collection of various data mining methods, it's a good time to acknowledge some facts:
- Most analysis workflows have multiple steps (cleaning, scaling, transforming, clustering, and so on)
- In order to reproduce the workflow, all of the steps must be done in the exact right order
- Failure to reproduce the steps exactly can result in bad information, often failing silently
- Humans make mistakes, so we need to guard against those mistakes
The perfect tool for guarding against mistakes is to build a pipeline, test it locally, and deploy the entire pipeline...