Data pipelines are important and ubiquitous. Even organizations with a small online presence run their own jobs: thousands of research facilities, meteorological centers, observatories, hospitals, military bases, and banks all run their internal data processing.
Another name for the data pipelines is ETL, which stands for Extract, Transform, and Load—three conceptual pieces of each pipeline. At first glance, the task may sound trivial. Most of our notebooks are, in a way, ETL jobs—we load some data, work with it, and then store it somewhere. However, building and maintaining a good pipeline requires a thorough and consistent approach. Processes should be reliable, easy to re-run, and reusable. Particular tasks shouldn't run more than once or if their dependencies are not satisfied (say, other tasks haven't finished yet).
It...