Building a Big Data Pipeline on Kubernetes
In the previous chapters, we covered the individual components required for building big data pipelines on Kubernetes. We explored tools such as Kafka, Spark, Airflow, Trino, and more. However, in the real world, these tools don’t operate in isolation. They need to be integrated and orchestrated to form complete data pipelines that can handle various data processing requirements.
In this chapter, we will bring together all the knowledge and skills you have acquired so far and put them into practice by building two complete data pipelines: a batch processing pipeline and a real-time pipeline. By the end of this chapter, you will be able to (1) deploy and orchestrate all the necessary tools for building big data pipelines on Kubernetes; (2) write code for data processing, orchestration, and querying using Python, SQL, and APIs; (3) integrate different tools seamlessly to create complex data pipelines; (4) understand and apply best...