Summary
In this chapter, we brought together all the knowledge and skills acquired throughout the book to build two complete data pipelines on Kubernetes: a batch processing pipeline and a real-time pipeline. We started by ensuring that all the necessary tools, such as a Spark operator, a Strimzi operator, Airflow, and Trino, were correctly deployed and running in our Kubernetes cluster.
For the batch pipeline, we orchestrated the entire process, from data acquisition and ingestion into a data lake on Amazon S3 to data processing using Spark, and finally delivering consumption-ready tables in Trino. We learned how to create Airflow DAGs, configure Spark applications, and integrate different tools seamlessly to build a complex, end-to-end data pipeline.
In the real-time pipeline, we tackled the challenges of processing and analyzing data streams in real time. We set up a Postgres database as our data source, deployed Kafka Connect and Elasticsearch, and built a Spark Streaming...