Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Big Data on Kubernetes

You're reading from   Big Data on Kubernetes A practical guide to building efficient and scalable data solutions

Arrow left icon
Product type Paperback
Published in Jul 2024
Publisher Packt
ISBN-13 9781835462140
Length 296 pages
Edition 1st Edition
Languages
Concepts
Arrow right icon
Author (1):
Arrow left icon
Neylson Crepalde Neylson Crepalde
Author Profile Icon Neylson Crepalde
Neylson Crepalde
Arrow right icon
View More author details
Toc

Table of Contents (18) Chapters Close

Preface 1. Part 1:Docker and Kubernetes
2. Chapter 1: Getting Started with Containers FREE CHAPTER 3. Chapter 2: Kubernetes Architecture 4. Chapter 3: Getting Hands-On with Kubernetes 5. Part 2: Big Data Stack
6. Chapter 4: The Modern Data Stack 7. Chapter 5: Big Data Processing with Apache Spark 8. Chapter 6: Building Pipelines with Apache Airflow 9. Chapter 7: Apache Kafka for Real-Time Events and Data Ingestion 10. Part 3: Connecting It All Together
11. Chapter 8: Deploying the Big Data Stack on Kubernetes 12. Chapter 9: Data Consumption Layer 13. Chapter 10: Building a Big Data Pipeline on Kubernetes 14. Chapter 11: Generative AI on Kubernetes 15. Chapter 12: Where to Go from Here 16. Index 17. Other Books You May Enjoy

Getting started with Airflow

In this first section, we will get Apache Airflow up and running on our local machine using the Astro CLI. Astro makes it easy to install and manage Apache Airflow. We will also take a deep dive into the components that make up Airflow’s architecture.

Installing Airflow with Astro

Astro is a command-line interface provided by Astronomer that allows you to quickly install and run Apache Airflow. With Astro, we can quickly spin up a local Airflow environment. It abstracts away the complexity of manually installing all Airflow components.

Installing the Astro CLI is very straightforward. You can find instructions for its installation here: https://docs.astronomer.io/astro/cli/install-cli. Once installed, the first thing to do is to initiate a new Airflow project. In the terminal, run the following command:

astro dev init

This will create a folder structure for an Airflow project locally. Next, start up Airflow:

astro dev start

This will pull the necessary Docker images and start containers for the Airflow web server, scheduler, worker, and PostgreSQL database.

You can access the Airflow UI at http://localhost:8080. The default username and password are admin.

That’s it! In just a few commands, we have a fully functioning Airflow environment up and running locally. Now let’s take a deeper look into Airflow’s architecture.

Airflow architecture

Airflow is composed of different components that fit together to provide a scalable and reliable orchestration platform for data pipelines.

At a high level, Airflow has the following:

  • A metadata database that stores state for DAGs, task instances, XComs, and so on
  • A web server that serves the Airflow UI
  • A scheduler that handles triggering DAGs and task instances
  • Executors that run task instances
  • Workers that execute tasks
  • Other components, such as the CLI

This architecture is depicted here:

Figure 6.1 – Airflow Architecture

Figure 6.1 – Airflow Architecture

Airflow relies heavily on the metadata database as the source of truth for state. The web server, scheduler, and worker processes talk to this database. When you look at the Airflow UI, underneath, it simply queries this database to get info to display.

The metadata database is also used to enforce certain constraints. For example, the scheduler uses database locks when examining task instances to determine what to schedule next. This prevents race conditions between multiple scheduler processes.

Important note

A race condition occurs when two or more threads or processes access a shared resource concurrently, and the final output depends on the sequence or timing of the execution. The threads “race” to access or modify the shared resource, and the final state depends, unpredictably, on who gets there first. Race conditions are a common source of bugs and unpredictable behavior in concurrent systems. They can result in corrupted data, crashes, or incorrect outputs.

Now let’s examine some of the key components in more detail.

Web server

The Airflow web server is responsible for hosting the Airflow UI you interact with, providing REST APIs for other services to communicate with Airflow, and serving static assets and pages. The Airflow UI allows you to monitor, trigger, and troubleshoot DAGs and tasks. It provides visibility into the overall health of your data pipelines.

The web server also exposes REST APIs that are used by the CLI, scheduler, workers, and custom applications to talk to Airflow. For example, the CLI uses the API to trigger DAGs. The scheduler uses it to update state for DAGs. Workers use it to update task instance state as they process them.

While the UI is very convenient for humans, services rely on the underlying REST APIs. Overall, the Airflow web server is critical as it provides a central way for users and services to interact with Airflow metadata.

Scheduler

The Airflow scheduler is the brains behind examining task instances and determining what to run next. Its key responsibilities include the following:

  • Checking the status of task instances in the metadata database
  • Examining dependencies between tasks to create a DAG run execution plan
  • Setting tasks to scheduled or queued in the database
  • Tracking the progress as task instances move through different states
  • Handling the backfilling of historical runs

To perform these duties, the scheduler does the following:

  1. Refreshes the DAG dictionary with details about all active DAGs
  2. Examines active DAG runs to see what tasks need to be scheduled
  3. Checks on the status of running tasks via the job tracker
  4. Updates the state of tasks in the database – queued, running, success, failed, and so on

Critical to the scheduler’s functioning is the metadata database. This allows it to be highly scalable since multiple schedulers can coordinate and sync via the single source of truth in the database.

The scheduler is very versatile – you can run a single scheduler for small workloads or scale up to multiple active schedulers for large workloads.

Executors

When a task needs to run, the executor is responsible for actually running the task. Executors interface with a pool of workers that execute tasks.

The most common executors are LocalExecutor, CeleryExecutor, and KubernetesExecutor:

  • LocalExecutor runs task instances in parallel processes on the host system. It is great for testing but has very limited scalability for large workloads.
  • CeleryExecutor uses a Celery pool to distribute tasks. It allows running workers across multiple machines and, thus, provides horizontal scalability.
  • KubernetesExecutor is specially designed for Airflow deployments running in Kubernetes. It launches worker Pods in Kubernetes dynamically. It provides excellent scalability and resource isolation.

As we move our Airflow to production, being able to scale out workers is critical. KubernetesExecutor will play a main role in our case.

For testing locally, LocalExecutor is the simplest. Astro configures this by default.

Workers

Workers execute the actual logic for task instances. The executor manages and interfaces with the worker pool. Workers carry out tasks such as the following:

  • Running Python functions
  • Executing Bash commands
  • Making API requests
  • Performing data transfers
  • Communicating task status

Based on the executor, workers may run on threads, server processes, or in separate containers. The worker communicates the status of task instances to the metadata database. It updates state to queued, running, success, failed, and so on. This allows the scheduler to monitor progress and coordinate pipeline execution across workers.

In summary, workers provide the compute resources necessary to run our pipeline tasks. The executor interfaces with and manages these workers.

Queueing

For certain executors, such as Celery and Kubernetes, you need an additional queueing service. This queue stores tasks before workers pick them up. There are a few common queueing technologies that can be used with Celery, such as RabbitMQ (a popular open source queue), Redis (an in-memory datastore), and Amazon SQS (a fully managed queue service by AWS).

For Kubernetes, we don’t need any of these tools as KubernetesExecutor dynamically launches Pods to execute tasks and kill them when the tasks are done.

Metadata database

As highlighted earlier, Airflow relies on its metadata database heavily. This database stores the state and metadata for Airflow to function. The default for local testing is SQLite, which is simple but has major scalability limitations. Even for moderate workloads, it is recommended to switch to a more production-grade database.

Airflow works with PostgreSQL, MySQL, and a variety of cloud-based database services, such as Amazon RDS.

Airflow’s distributed architecture

As we can see, Airflow works with a modular distributed architecture. This design brings several advantages for production workloads:

  • Separation of concerns: Each component focuses on a specific job. The scheduler handles examining DAGs and scheduling. The worker runs task instances. This separation of concerns keeps components simple and maintainable.
  • Scalability: Components such as the scheduler, worker, and database can be easily scaled out. Run multiple schedulers or workers as your workload grows. Leverage a hosted database for automatic scaling.
  • Reliability: If one scheduler or worker dies, there is no overall outage since components are decoupled. The single source of truth in the database also provides consistency across Airflow.
  • Extensibility: You can swap out certain components, such as the executor or queueing service.

In summary, Airflow provides scalability, reliability, and flexibility via its modular architecture. Each component has a focused job, leading to simplicity and stability in the overall system.

Now, let’s get back to Airflow and start building some simple DAGs.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image