Deploying the Big Data Stack on Kubernetes
In this chapter, we will cover the deployment of key big data technologies – Spark, Airflow, and Kafka – on Kubernetes. As container orchestration and management have become critical for running data workloads efficiently, Kubernetes has emerged as the de facto standard. By the end of this chapter, you will be able to successfully deploy and manage big data stacks on Kubernetes for building robust data pipelines and applications.
We will start by deploying Apache Spark on Kubernetes using the Spark operator. You will learn how to configure and monitor Spark jobs running as Spark applications on your Kubernetes cluster. Being able to run Spark workloads on Kubernetes brings important benefits such as dynamic scaling, versioning, and unified resource management.
Next, we will deploy Apache Airflow on Kubernetes. You will configure Airflow on Kubernetes, link its logs to S3 for easier debugging and monitoring, and set it up...