Chapter 12: Orchestrating Amazon EMR Jobs with AWS Step Functions and Apache Airflow/MWAA
In the previous few chapters, we explained how you can leverage the EMR cluster for on-demand ETL jobs or long-running clusters that either execute a real-time streaming application or serve as a backend for interactive development using notebooks. But when we build a data pipeline to automate data ingestion, cleansing, or transformations, we look for orchestration tools with which we can build workflows that either get kicked off through a schedule or through an event.
There are two primary orchestration tools – AWS Step Functions and Apache Airflow, which are very popular in building data pipelines with Amazon EMR. AWS also provides a managed offering of Airflow, called Amazon Managed Workflows for Apache Airflow (MWAA).
In this chapter, we will provide an overview of AWS Step Functions and MWAA services and then explain how you can leverage them to orchestrate a data pipeline that...