The modern data stack
Amazon Redshift was the first cloud-native data warehouse and provided a real step-change in capabilities. It had the ability to store almost limitless data at a low cost in a SQL-compatible database, and the massively parallel processing (MPP) capabilities meant you could process that data effectively and efficiently at scale.
This sounds like what we had with Hadoop, but the key differences were the SQL compatibility and the more strongly defined structure of the data. This made it much more accessible than the unstructured files on an HDFS cluster. It also presented an opportunity to build services on top of Redshift and later SQL-compatible warehouses such as Google BigQuery and Snowflake, which led to an explosion of tools that make up today’s modern data stack. This includes ELT tools such as Fivetran and Stitch, data transformation tools such as dbt, and reverse ETL tools such as Hightouch.
These data warehouses evolved further to become what we now call a data lakehouse, which brings together the benefits of a modern data warehouse (SQL compatibility and high performance with MPP) with the benefits of a data lake (low cost, limitless storage, and support for different data types).
Into this data lakehouse went all the source data we ingested from our systems and third-party services, becoming our operational data store (ODS). From here, we could join and transform the data and make it available to our EDW, from where it is available for consumption. But the data warehouse was no longer a separate database – it was just a logically separate area of our data lakehouse, using the same technology. This reduced the effort and costs of the transforms and further increased the accessibility of the data.
The following diagram shows the reference architecture of the modern data stack, with the data lakehouse in the center:
Figure 1.3 – The modern data stack architecture
This architecture gives us more options to ingest the source data, and one of those is using change data capture (CDC) tooling, for which we have open source implementations such as Debezium and commercial offerings such as Striim and Google Cloud Datastream, as well as in-depth write-ups on closed source solutions at organizations including Airbnb (https://medium.com/airbnb-engineering/capturing-data-evolution-in-a-service-oriented-architecture-72f7c643ee6f) and Netflix (https://netflixtechblog.com/dblog-a-generic-change-data-capture-framework-69351fb9099b). CDC tools connect to the transactional databases of your upstream servers and capture all the changes that happen to each of the tables (i.e., the INSERT
, UPDATE
, and DELETE
statements run against the database). These are sent to the data lakehouse, and from there, you can recreate the database in the lakehouse with the same structure and the same data.
However, this creates a tight coupling between the internal models of the upstream service and database and the data consumers. As that service naturally evolves over time, breaking changes will be made to those models. When these happen – often without any notice – they impact the CDC service and/or downstream data uses, leading to instability and unreliability. This makes it impossible to build on this data with any confidence.
The data is also not structured well for analytical queries and uses. It has been designed to meet the needs of the service and to be optimal for a transactional database, not a data lakehouse. It can take a lot of transformation and joining to take this data and produce something that meets the requirements of your downstream users, which is time-consuming and expensive.
There is often little or no documentation for this data, and so to make use of it you need to have in-depth knowledge of those source systems and the way they model the data, including the history of how that has evolved over time. This typically comes from asking teams who work on that service or relying on institutional knowledge from colleagues who have worked with that data before. This makes it difficult to discover new or useful datasets, or for a new consumer to get started.
The root cause of all these problems is that this data was not built for consumption.
Many of these same problems apply to data ingested from a third-party service through an ELT tool such as Fivetran or Stitch. This is particularly true if you’re ingesting from a complex service such as Salesforce, which is highly customizable with custom objects and fields. The data is in a raw form that mimics the API of the third-party service, lacks documentation, and requires in-depth knowledge of the service to use. Like with CDC, it can still change without notice and requires a lot of transformation to produce something that meets your requirements.
One purported benefit of the modern data stack is that we now have more data available to us than ever before. However, a 2022 report from Seagate (https://www.seagate.com/gb/en/our-story/rethink-data/) found that 68% of the data available to organizations goes unused. We still have our dark data problem from the big data era.
The introduction of dbt and similar tools that run on a data lakehouse has made it easier than ever to process this data using just SQL – one of the most well-known and popular languages around. This should increase the accessibility of the data in the data lakehouse.
However, due to the complexity of the transforms required to make use of this data and the domain knowledge you must build up, we still often end up with a central team of data engineers to build and maintain the hundreds, thousands, or even tens of thousands of models required to produce data that is ready for consumption by other data practitioners and users.
Note
We’ll return to our example for the final time to illustrate how different roles work together with this architecture.
Our data generator, Vivianne, is a software engineer working on a service that writes its data to a database. She may or may not be aware that the data from that database is extracted in a raw form through a CDC service. Certainly, she doesn’t know why.
Ben is a data platform engineer who works on the CDC pipeline. He aims to extract as much of the data as possible into the data lakehouse. He doesn’t know much about the data itself, or what it will be used for. He spends a lot of time dealing with changing schemas that break his pipelines.
Leah is an analytics engineer building dbt pipelines. She takes requirements from data analysts and builds datasets to meet those requirements. She struggles to find the data she wants and needs to learn a lot about the upstream services and their data models in order to produce what she hopes is the right data. These dbt pipelines now number in the thousands and no one has all the context required to debug them all. The pipelines break regularly, and those breakages often have a wide impact.
The BI analyst, Bukayo, takes this data and creates reports to support the business. They often break due to an issue upstream. There are no expectations defined at any of these steps, and therefore no guarantees on the reliability or correctness of the data can be provided to those consuming Bukayo’s data.
The data generator, Vivianne, is far away from the data consumer, Bukayo, and there is no communication. Vivianne has no understanding or visibility of how the changes she makes affect key business processes.
While Bukayo and his peers can usually get the data they need prioritized by Leah and Ben, those who are not BI analysts and want data for other needs have access to the data in a structured form, but lack the domain knowledge to use it effectively. They lack the autonomy to ask for the data they need to meet their requirements.
So, despite the improvements in the technology and architecture over three generations of data platform architectures, we still have that bottleneck of a central team with a long backlog of datasets to make available to the organization before we can start using it to drive business value.
The following diagram shows the three generations side by side, with the same bottleneck highlighted in each:
Figure 1.4 – Comparing the three generations of data platform architectures
It’s that bottleneck that has led us to the state of today’s data platforms and the trouble many of us face when trying to generate business value from our data. In the next section, we’re going to discuss the problems we have when we build data platforms on this architecture.