Introduction
Pipelines are fundamental in any data science environment. Data processing is never a single task. Many pipelines are implemented via ad hoc scripts. This can be done in a useful way, but in many cases, they fail many fundamental viewpoints: reproducibility, maintainability, and extensibility.
In bioinformatics, you can find three main types of pipeline systems:
- Frameworks like Galaxy (https://usegalaxy.org), which are geared toward users, that is, they expose easy-to-use user interfaces, hiding most of the underlying machinery
- Frameworks like Script of Scripts (SoS) (https://vatlab.github.io/sos-docs/), which are geared toward data analysis, with a focus on with programming knowledge
- Finally generic workflow systems like Apache Airflow (https://airflow.incubator.apache.org/), which take a less data-centered approach to workflow management
In this chapter, we will discuss Galaxy, which is especially important for bioinformaticians who are supporting users that are less inclined to...