You're reading from Big Data on Kubernetes A practical guide to building efficient and scalable data solutions

Product type Paperback

Published in Jul 2024

Publisher Packt

ISBN-13 9781835462140

Length 296 pages

Edition 1st Edition

Languages

Python

Tools

Kubernetes

Concepts

Big Data

Author (1):

Neylson Crepalde

View More author details

Table of Contents (18) Chapters

Preface

1. Part 1:Docker and Kubernetes

2. Chapter 1: Getting Started with Containers FREE CHAPTER

3. Chapter 2: Kubernetes Architecture

4. Chapter 3: Getting Hands-On with Kubernetes

5. Part 2: Big Data Stack

6. Chapter 4: The Modern Data Stack

7. Chapter 5: Big Data Processing with Apache Spark

8. Chapter 6: Building Pipelines with Apache Airflow

9. Chapter 7: Apache Kafka for Real-Time Events and Data Ingestion

10. Part 3: Connecting It All Together

11. Chapter 8: Deploying the Big Data Stack on Kubernetes

12. Chapter 9: Data Consumption Layer

13. Chapter 10: Building a Big Data Pipeline on Kubernetes

14. Chapter 11: Generative AI on Kubernetes

15. Chapter 12: Where to Go from Here

16. Index

Why subscribe?

17. Other Books You May Enjoy

What this book covers

Chapter 1, Getting Started with Containers, embarks on a journey to understand containers and Docker, the foundational technologies for modern application deployment. You’ll learn how to install Docker and run your first container image, experiencing the power of containerization firsthand. Additionally, you’ll dive into the intricacies of Dockerfiles, mastering the art of crafting concise and functional container images. Through practical examples, including the construction of a simple API and a data processing job with Python, you’ll grasp the nuances of containerizing services and jobs. By the end of this chapter, you’ll have the opportunity to solidify your newfound knowledge by building your own job and API, laying the groundwork for a portfolio of practical container-based applications.

Chapter 2, Kubernetes Architecture, introduces you to the core components that make up the Kubernetes architecture. You will learn about the control plane components such as the API server, etcd, scheduler, and controller manager, as well as the worker node components such as kubelet, kube-proxy, and container runtime. The chapter will explain the roles and responsibilities of each component, and how they interact with each other to ensure the smooth operation of a Kubernetes cluster. Additionally, you will gain an understanding of the key concepts in Kubernetes, including pods, deployments, services, jobs, stateful sets, persistent volumes, ConfigMaps, and secrets. By the end of this chapter, you will have a solid foundation in the architecture and core concepts of Kubernetes, preparing you for hands-on experience in the subsequent chapters.

Chapter 3, Kubernetes – Hands On, guides you through the process of deploying a local Kubernetes cluster using kind, and a cloud-based cluster on AWS using Amazon EKS. You will learn the minimal AWS account configuration required to successfully deploy an EKS cluster. After setting up the clusters, you will have the opportunity to choose between deploying your applications on the local or cloud environment. Regardless of your choice, you will retake the API and data processing jobs developed in Chapter 1 and deploy them to Kubernetes. This hands-on experience will solidify your understanding of Kubernetes concepts and prepare you for more advanced topics in the following chapters.

Chapter 4, The Modern Data Stack, introduces you to the most well-known data architecture designs, with a focus on the “lambda” architecture. You will learn about the tools that make up the modern data stack, which is a set of technologies used to implement a data lake(house) architecture. Among these tools are Apache Spark for data processing, Apache Airflow for data pipeline orchestration, and Apache Kafka for real-time event streaming and data ingestion. This chapter will provide a conceptual introduction to these tools and how they work together to build the core technology assets of a data lake(house) architecture.

Chapter 5, Big Data Processing with Apache Spark, introduces you to Apache Spark, one of the most popular tools for big data processing. You will understand the core components of a Spark program, how it scales and handles distributed processing, and best practices for working with Spark. You will implement simple data processing tasks using both the DataFrames API and the Spark SQL API, leveraging Python to interact with Spark. The chapter will guide you through installing Spark locally for testing purposes, enabling you to gain hands-on experience with this powerful tool before deploying it on a larger scale.

Chapter 6, Apache Airflow for Building Pipelines, introduces you to Apache Airflow, a widely adopted open source tool for data pipeline orchestration. You will learn how to install Airflow using Docker and Astro CLI, making the setup process straightforward. The chapter will familiarize you with Airflow’s core features and the most commonly used operators for data engineering tasks. Additionally, you will gain insights into best practices for building resilient and efficient data pipelines that leverage Airflow’s capabilities to the fullest. By the end of this chapter, you will have a solid understanding of how to orchestrate complex data workflows using Airflow, a crucial skill for any data engineer or data architect working with big data on Kubernetes.

Chapter 7, Apache Kafka for Real-Time Events and Data Ingestion, introduces you to Apache Kafka, a distributed event streaming platform that is widely used for building real-time data pipelines and streaming applications. You will understand Kafka’s architecture and how it scales while being resilient, enabling it to handle high volumes of real-time data with low latency. You will learn about Kafka’s distributed topics design, which underpins its robust performance for real-time events. The chapter will guide you through running Kafka locally with Docker and implementing basic reading and writing operations on topics. Additionally, you will explore different strategies for data replication and topic distribution, ensuring you can design and implement efficient and reliable Kafka clusters.

Chapter 8, Deploying the Big Data Stack on Kubernetes, guides you through the process of deploying the big data tools you learned about in the previous chapters on a Kubernetes cluster. You will start by building bash scripts to deploy the Spark operator and run SparkApplications on Kubernetes. Next, you will deploy Apache Airflow to Kubernetes, enabling you to orchestrate data pipelines within the cluster. Additionally, you will deploy Apache Kafka on Kubernetes using both the ephemeral cluster and JBOD techniques. The Kafka Connect cluster will also be deployed, along with connectors to migrate data from SQL databases to persistent object storage. By the end of this chapter, you will have a fully functional big data stack running on Kubernetes, ready for further exploration and development.

Chapter 9, Data Consumption Layer, guides you through the process of securely making data available for business analysts in a big data architecture deployed on Kubernetes. You will start by gaining an overview of working on a modern approach using a “data lake engine” instead of a data warehouse. In this chapter, you will become familiar with Trino for data consumption directly from a data lake through Kubernetes. You will understand how a data lake engine works, deploy it into Kubernetes, and monitor query execution and history. Additionally, for real-time data, you will get familiar with Elasticsearch and Kibana for data consumption. You will deploy these tools, and learn how to index data in them and how to build a simple data visualization with Kibana.

Chapter 10, Building a Big Data Pipeline in Kubernetes, guides you through the process of deploying and orchestrating two complete data pipelines, one for batch processing and another for real-time processing, on a Kubernetes cluster. You will connect all the tools you’ve learned about throughout the book, such as Apache Spark, Apache Airflow, Apache Kafka, and Trino, to build a single, complex solution. You will deploy these tools on Kubernetes, write code for data processing and orchestration, and make the data available for querying through a SQL engine. By the end of this chapter, you will have hands-on experience in building and managing a comprehensive big data pipeline on Kubernetes, integrating various components and technologies into a cohesive and scalable architecture.

Chapter 11, Generative AI on Kubernetes, guides you through the process of deploying a generative AI application on Kubernetes using Amazon Bedrock as a service suite for foundational models. You will learn how to connect your application to a knowledge base serving as a Retrieval-Augmented Generation (RAG) layer, which enhances the AI model’s capabilities by providing access to external information sources. Additionally, you will discover how to automate task execution by the AI models with agents, enabling seamless integration of generative AI into your workflows. By the end of this chapter, you will have a solid understanding of how to leverage the power of generative AI on Kubernetes, unlocking new possibilities for personalized customer experiences, intelligent assistants, and automated business analytics.

Chapter 12, Where to Go from Here, guides you through the next steps in your journey toward mastering big data and Kubernetes. You will explore crucial concepts and technologies that are essential for building robust and scalable solutions on Kubernetes. This includes monitoring strategies for both Kubernetes and your applications, implementing a service mesh for efficient communication, securing your cluster and applications, enabling automated scalability, embracing GitOps and CI/CD practices for streamlined deployment and management, and Kubernetes cost control. For each topic, you’ll receive an overview and recommendations on the technologies to explore further, empowering you to deepen your knowledge and skills in these areas.