We have so far looked at data analysis life cycle tasks in isolation. In the real world, these tasks need to be connected together to create a cohesive solution. Data pipelines are about creating end-to-end, data-oriented solutions.
Spark supports ML pipelines (https://spark.apache.org/docs/2.3.0/ml-pipeline.html). We will look at Spark and how to use Spark's ML pipeline functionality in subsequent chapters.
Jupyter Notebooks (http://jupyter.org/) is another great option for creating an integrated data pipeline. Papermill (https://github.com/nteract/papermill) is an open source project that helps parameterize and run Jupyter Notebooks. We will explore some of these options in subsequent chapters.