Until now, we have been writing code as separate notebooks and scripts. In the previous chapter, we learned how to group those scripts into a package so that it can be distributed and tested properly. In many cases, however, we need to execute certain tasks on a strict schedule. Often, it is needed to process certain data—pull off analytics, collect information from external sources, or re-train an ML model. All of this is prone to errors: tasks may depend on other tasks, and some tasks shouldn't run before others. It is important that tasks should be easy to orchestrate, monitor, and re-run for ease of use.
In this chapter, we will learn to build and orchestrate our own data pipelines. Building good pipelines is an important skill that can save tons of time and stress for anyone who masters it.
In particular, we will cover the following topics...