Comparing the Lambda and Kappa architectures
In the beginning, we started with batch processing or scheduled data processing jobs. When we run data workloads in batches, we are setting a specific chronological cadence for those workloads to be triggered. For most workloads, this is perfectly fine, but there will always be a more significant time delay in our data. As technology has progressed, the ability to utilize real-time processing has become possible.
At the time of writing, there are two different directions architects are taking in dealing with these two workloads.
Lambda architecture
The following is the Lambda architecture, which has a combined batch and real-time consumption and serving layer:
Figure 1.3: Combined Lambda architecture
The following diagram shows a Lambda architecture with separate consumption and serving layers, one for batch and the other for real-time processing:
Figure 1.4: Separate combined Lambda architecture
The Lambda architecture was the first attempt to deal with both streaming and batch data. It grew out of systems that started with just traditional batch data. As a result, the Lambda architecture uses two separate layers for processing data. The first layer is the batch layer, which performs data transformations that are harder to accomplish in real-time processing workstreams. The second layer is a real-time layer, which is meant for processing data as soon as its ingested. As data is transformed in each layer, the data should have a unique ID that allows the data to be correlated, no matter the workstream.
Once the data products have been created from each layer, there can be a separate or combined consumption layer. A combined consumption layer is easier to create but given the technology, it can be challenging to accomplish complex models that span both types of data. In the consumption layer, batch and real-time processing can be combined, which will require matching IDs. The consumption layer is a landing zone for data products made in the batch or real-time layer. The storage mechanism for this layer can range from a data lake or a data lakehouse to a relational database. The serving layer is used to take data products in the consumption layer and create views, run AI, and access data through tools such as dashboards, apps, and notebooks.
The Lambda architecture is relatively easy to adopt, given that the technology and patterns are typically known and fit into a typical engineer’s comfort zone. Most deployments already have a batch layer, so often, the real-time layer is a bolt-on addition to the platform. What tends to happen over time is that the complexity grows in several ways. Two very complex systems must be maintained and coordinated. Also, two distinct sets of software must be developed and maintained. In most cases, the two layers do not have similar technology, which will translate into a variety of techniques and languages for writing software in and keeping it updated.
Kappa architecture
The following diagram shows the Kappa architecture. The essential details are that there is only one set of layers and batch data is extracted via real-time processing:
Figure 1.5: Kappa architecture
The Kappa architecture was designed due to frustrations with the Lambda architecture. With the Lambda architecture, we have two layers, one for batch processing and the other for stream processing. The Kappa architecture has only a single real-time layer, which it uses for all data. Now, if we take a step back, there will always be some amount of oddness because batch data isn’t real-time data. There is still a consumption layer that’s used to store data products and a serving layer for accessing those data products. Again, the caveat is that many batch-based workloads will need to be customized so that they only use streaming data. Kappa is often found in large tech companies that have a wealth of tech talent and the need for fast, real-time data access.
Where Lambda was relatively easy to adopt, Kappa is highly complex in comparison. Often, the minimal use case for a typical company for real-time data does not warrant such a difficult change. As expected, there are considerable benefits to the Kappa architecture. For one, maintenance is reduced significantly, and the code base is also slimmed down. Real-time data can be complex to work with at times. Think of a giant hose that can’t ever be turned off. Issues with data in the Kappa architecture can often be very challenging, given the nature of the data storage. In the batch processing layer, it’s easy to deploy a change to the data, but in the real-time processing layer, reprocessing the data is no trivial matter. What often happens is secondary storage is adopted for data so that data can be accessed in both places. A straightforward example of why having a copy of data in a secondary system is, for example, when in Kafka, you need to constantly adjust the data. We will discuss Kafka later, but I will just mention that having a way to dump a Kafka topic and quickly repopulate it can be a lifesaver.