You're reading from Modern Data Architectures with Python A practical guide to building and deploying data pipelines, data warehouses, and data lakes with Python

Product type Paperback

Published in Sep 2023

Publisher Packt

ISBN-13 9781801070492

Length 318 pages

Edition 1st Edition

Languages

Python

Tools

MLflow

Concepts

Data Science

Author (1):

Brian Lipp

View More author details

Table of Contents (19) Chapters

Preface

1. Part 1:Fundamental Data Knowledge

2. Chapter 1: Modern Data Processing Architecture FREE CHAPTER

3. Chapter 2: Understanding Data Analytics

4. Part 2: Data Engineering Toolset

5. Chapter 3: Apache Spark Deep Dive

6. Chapter 4: Batch and Stream Data Processing Using PySpark

7. Chapter 5: Streaming Data with Kafka

8. Part 3:Modernizing the Data Platform

9. Chapter 6: MLOps

10. Chapter 7: Data and Information Visualization

11. Chapter 8: Integrating Continous Integration into Your Workflow

12. Chapter 9: Orchestrating Your Data Workflows

13. Part 4:Hands-on Project

14. Chapter 10: Data Governance

15. Chapter 11: Building out the Groundwork

16. Chapter 12: Completing Our Project

17. Index

Why subscribe?

18. Other Books You May Enjoy

Comparing the Lambda and Kappa architectures

In the beginning, we started with batch processing or scheduled data processing jobs. When we run data workloads in batches, we are setting a specific chronological cadence for those workloads to be triggered. For most workloads, this is perfectly fine, but there will always be a more significant time delay in our data. As technology has progressed, the ability to utilize real-time processing has become possible.

At the time of writing, there are two different directions architects are taking in dealing with these two workloads.

Lambda architecture

The following is the Lambda architecture, which has a combined batch and real-time consumption and serving layer:

Figure 1.3: Combined Lambda architecture

The following diagram shows a Lambda architecture with separate consumption and serving layers, one for batch and the other for real-time processing:

Figure 1.4: Separate combined Lambda architecture

The Lambda architecture was the first attempt to deal with both streaming and batch data. It grew out of systems that started with just traditional batch data. As a result, the Lambda architecture uses two separate layers for processing data. The first layer is the batch layer, which performs data transformations that are harder to accomplish in real-time processing workstreams. The second layer is a real-time layer, which is meant for processing data as soon as its ingested. As data is transformed in each layer, the data should have a unique ID that allows the data to be correlated, no matter the workstream.

Once the data products have been created from each layer, there can be a separate or combined consumption layer. A combined consumption layer is easier to create but given the technology, it can be challenging to accomplish complex models that span both types of data. In the consumption layer, batch and real-time processing can be combined, which will require matching IDs. The consumption layer is a landing zone for data products made in the batch or real-time layer. The storage mechanism for this layer can range from a data lake or a data lakehouse to a relational database. The serving layer is used to take data products in the consumption layer and create views, run AI, and access data through tools such as dashboards, apps, and notebooks.

The Lambda architecture is relatively easy to adopt, given that the technology and patterns are typically known and fit into a typical engineer’s comfort zone. Most deployments already have a batch layer, so often, the real-time layer is a bolt-on addition to the platform. What tends to happen over time is that the complexity grows in several ways. Two very complex systems must be maintained and coordinated. Also, two distinct sets of software must be developed and maintained. In most cases, the two layers do not have similar technology, which will translate into a variety of techniques and languages for writing software in and keeping it updated.

Kappa architecture

The following diagram shows the Kappa architecture. The essential details are that there is only one set of layers and batch data is extracted via real-time processing:

Figure 1.5: Kappa architecture

The Kappa architecture was designed due to frustrations with the Lambda architecture. With the Lambda architecture, we have two layers, one for batch processing and the other for stream processing. The Kappa architecture has only a single real-time layer, which it uses for all data. Now, if we take a step back, there will always be some amount of oddness because batch data isn’t real-time data. There is still a consumption layer that’s used to store data products and a serving layer for accessing those data products. Again, the caveat is that many batch-based workloads will need to be customized so that they only use streaming data. Kappa is often found in large tech companies that have a wealth of tech talent and the need for fast, real-time data access.

Where Lambda was relatively easy to adopt, Kappa is highly complex in comparison. Often, the minimal use case for a typical company for real-time data does not warrant such a difficult change. As expected, there are considerable benefits to the Kappa architecture. For one, maintenance is reduced significantly, and the code base is also slimmed down. Real-time data can be complex to work with at times. Think of a giant hose that can’t ever be turned off. Issues with data in the Kappa architecture can often be very challenging, given the nature of the data storage. In the batch processing layer, it’s easy to deploy a change to the data, but in the real-time processing layer, reprocessing the data is no trivial matter. What often happens is secondary storage is adopted for data so that data can be accessed in both places. A straightforward example of why having a copy of data in a secondary system is, for example, when in Kafka, you need to constantly adjust the data. We will discuss Kafka later, but I will just mention that having a way to dump a Kafka topic and quickly repopulate it can be a lifesaver.

You're reading from Modern Data Architectures with Python A practical guide to building and deploying data pipelines, data warehouses, and data lakes with Python

Table of Contents (19) Chapters

Comparing the Lambda and Kappa architectures

Lambda architecture

Kappa architecture

Authors (1)

Personalised recommendations for you