Getting started with Airflow
In this first section, we will get Apache Airflow up and running on our local machine using the Astro CLI. Astro makes it easy to install and manage Apache Airflow. We will also take a deep dive into the components that make up Airflow’s architecture.
Installing Airflow with Astro
Astro is a command-line interface provided by Astronomer that allows you to quickly install and run Apache Airflow. With Astro, we can quickly spin up a local Airflow environment. It abstracts away the complexity of manually installing all Airflow components.
Installing the Astro CLI is very straightforward. You can find instructions for its installation here: https://docs.astronomer.io/astro/cli/install-cli. Once installed, the first thing to do is to initiate a new Airflow project. In the terminal, run the following command:
astro dev init
This will create a folder structure for an Airflow project locally. Next, start up Airflow:
astro dev start
This will pull the necessary Docker images and start containers for the Airflow web server, scheduler, worker, and PostgreSQL database.
You can access the Airflow UI at http://localhost:8080. The default username and password are admin.
That’s it! In just a few commands, we have a fully functioning Airflow environment up and running locally. Now let’s take a deeper look into Airflow’s architecture.
Airflow architecture
Airflow is composed of different components that fit together to provide a scalable and reliable orchestration platform for data pipelines.
At a high level, Airflow has the following:
- A metadata database that stores state for DAGs, task instances, XComs, and so on
- A web server that serves the Airflow UI
- A scheduler that handles triggering DAGs and task instances
- Executors that run task instances
- Workers that execute tasks
- Other components, such as the CLI
This architecture is depicted here:
Figure 6.1 – Airflow Architecture
Airflow relies heavily on the metadata database as the source of truth for state. The web server, scheduler, and worker processes talk to this database. When you look at the Airflow UI, underneath, it simply queries this database to get info to display.
The metadata database is also used to enforce certain constraints. For example, the scheduler uses database locks when examining task instances to determine what to schedule next. This prevents race conditions between multiple scheduler processes.
Important note
A race condition occurs when two or more threads or processes access a shared resource concurrently, and the final output depends on the sequence or timing of the execution. The threads “race” to access or modify the shared resource, and the final state depends, unpredictably, on who gets there first. Race conditions are a common source of bugs and unpredictable behavior in concurrent systems. They can result in corrupted data, crashes, or incorrect outputs.
Now let’s examine some of the key components in more detail.
Web server
The Airflow web server is responsible for hosting the Airflow UI you interact with, providing REST APIs for other services to communicate with Airflow, and serving static assets and pages. The Airflow UI allows you to monitor, trigger, and troubleshoot DAGs and tasks. It provides visibility into the overall health of your data pipelines.
The web server also exposes REST APIs that are used by the CLI, scheduler, workers, and custom applications to talk to Airflow. For example, the CLI uses the API to trigger DAGs. The scheduler uses it to update state for DAGs. Workers use it to update task instance state as they process them.
While the UI is very convenient for humans, services rely on the underlying REST APIs. Overall, the Airflow web server is critical as it provides a central way for users and services to interact with Airflow metadata.
Scheduler
The Airflow scheduler is the brains behind examining task instances and determining what to run next. Its key responsibilities include the following:
- Checking the status of task instances in the metadata database
- Examining dependencies between tasks to create a DAG run execution plan
- Setting tasks to scheduled or queued in the database
- Tracking the progress as task instances move through different states
- Handling the backfilling of historical runs
To perform these duties, the scheduler does the following:
- Refreshes the DAG dictionary with details about all active DAGs
- Examines active DAG runs to see what tasks need to be scheduled
- Checks on the status of running tasks via the job tracker
- Updates the state of tasks in the database – queued, running, success, failed, and so on
Critical to the scheduler’s functioning is the metadata database. This allows it to be highly scalable since multiple schedulers can coordinate and sync via the single source of truth in the database.
The scheduler is very versatile – you can run a single scheduler for small workloads or scale up to multiple active schedulers for large workloads.
Executors
When a task needs to run, the executor is responsible for actually running the task. Executors interface with a pool of workers that execute tasks.
The most common executors are LocalExecutor
, CeleryExecutor
, and KubernetesExecutor
:
- LocalExecutor runs task instances in parallel processes on the host system. It is great for testing but has very limited scalability for large workloads.
- CeleryExecutor uses a Celery pool to distribute tasks. It allows running workers across multiple machines and, thus, provides horizontal scalability.
- KubernetesExecutor is specially designed for Airflow deployments running in Kubernetes. It launches worker Pods in Kubernetes dynamically. It provides excellent scalability and resource isolation.
As we move our Airflow to production, being able to scale out workers is critical. KubernetesExecutor will play a main role in our case.
For testing locally, LocalExecutor is the simplest. Astro configures this by default.
Workers
Workers execute the actual logic for task instances. The executor manages and interfaces with the worker pool. Workers carry out tasks such as the following:
- Running Python functions
- Executing Bash commands
- Making API requests
- Performing data transfers
- Communicating task status
Based on the executor, workers may run on threads, server processes, or in separate containers. The worker communicates the status of task instances to the metadata database. It updates state to queued, running, success, failed, and so on. This allows the scheduler to monitor progress and coordinate pipeline execution across workers.
In summary, workers provide the compute resources necessary to run our pipeline tasks. The executor interfaces with and manages these workers.
Queueing
For certain executors, such as Celery and Kubernetes, you need an additional queueing service. This queue stores tasks before workers pick them up. There are a few common queueing technologies that can be used with Celery, such as RabbitMQ (a popular open source queue), Redis (an in-memory datastore), and Amazon SQS (a fully managed queue service by AWS).
For Kubernetes, we don’t need any of these tools as KubernetesExecutor dynamically launches Pods to execute tasks and kill them when the tasks are done.
Metadata database
As highlighted earlier, Airflow relies on its metadata database heavily. This database stores the state and metadata for Airflow to function. The default for local testing is SQLite, which is simple but has major scalability limitations. Even for moderate workloads, it is recommended to switch to a more production-grade database.
Airflow works with PostgreSQL, MySQL, and a variety of cloud-based database services, such as Amazon RDS.
Airflow’s distributed architecture
As we can see, Airflow works with a modular distributed architecture. This design brings several advantages for production workloads:
- Separation of concerns: Each component focuses on a specific job. The scheduler handles examining DAGs and scheduling. The worker runs task instances. This separation of concerns keeps components simple and maintainable.
- Scalability: Components such as the scheduler, worker, and database can be easily scaled out. Run multiple schedulers or workers as your workload grows. Leverage a hosted database for automatic scaling.
- Reliability: If one scheduler or worker dies, there is no overall outage since components are decoupled. The single source of truth in the database also provides consistency across Airflow.
- Extensibility: You can swap out certain components, such as the executor or queueing service.
In summary, Airflow provides scalability, reliability, and flexibility via its modular architecture. Each component has a focused job, leading to simplicity and stability in the overall system.
Now, let’s get back to Airflow and start building some simple DAGs.