Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Driving Data Quality with Data Contracts
Driving Data Quality with Data Contracts

Driving Data Quality with Data Contracts: A comprehensive guide to building reliable, trusted, and effective data platforms

eBook
€8.99 €26.99
Paperback
€33.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Driving Data Quality with Data Contracts

A Brief History of Data Platforms

Before we can appreciate why we need to make a fundamental shift to a data contracts-backed data platform in order to improve the quality of our data, and ultimately the value we can get from that data, we need to understand the problems we are trying to solve. I’ve found the best way to do this is to look back at the recent generations of data architectures. By doing that, we’ll see that despite the vast improvements in the tooling available to us, we’ve been carrying through the same limitations in the architecture. That’s why we continue to struggle with the same old problems.

Despite these challenges, the importance of data continues to grow. As it is used in more and more business-critical applications, we can no longer accept data platforms that are unreliable, untrusted, and ineffective. We must find a better way.

By the end of this chapter, we’ll have explored the three most recent generations of data architectures at a high level, focusing on just the source and ingestion of upstream data, and the consumption of data downstream. We will gain an understanding of their limitations and bottlenecks and why we need to make a change. We’ll then be ready to learn about data contracts.

In this chapter, we’re going to cover the following main topics:

  • The enterprise data warehouse
  • The big data platform
  • The modern data stack
  • The state of today’s data platforms
  • The ever-increasing use of data in business-critical applications

The enterprise data warehouse

We’ll start by looking at the data architecture that was prevalent in the late 1990s and early 2000s, which was centered around an enterprise data warehouse (EDW). As we discuss the architecture and its limitations, you’ll start to notice how many of those limitations continue to affect us today, despite over 20 years of advancement in tools and capabilities.

EDW is the collective term for a reporting and analytics solution. You’d typically engage with one or two big vendors who would provide these capabilities for you. It was expensive and only larger companies that could justify the investment.

The architecture was built around a large database in the center. This was likely an Oracle or MS SQL Server database, hosted on-premises (this was before the advent of cloud services). The extract, transform, and load (ETL) process was performed on data from source systems, or more accurately, the underlying database of those systems. That data could then be used to drive reporting and analytics.

The following diagram shows the EDW architecture:

Figure 1.1 – The EDW architecture

Figure 1.1 – The EDW architecture

Because this ETL ran against the database of the source system, reliability was a problem. It created a load on the database that could negatively impact the performance of the upstream service. That, and the limitations of the technology we were using at the time, meant we could do few transforms on the data.

We also had to update the ETL process as the database schema and the data evolved over time, relying on the data generators to let us know when that happened. Otherwise, the pipeline would fail.

Those who owned databases were somewhat aware of the ETL work and the business value it drove. There were few barriers between the data generators and consumers and good communication.

However, the major limitation of this architecture was the database used for the data warehouse. It was very expensive and, as it was deployed on-premises, was of a fixed size and hard to scale. That created a limit on how much data could be stored there and made available for analytics.

It became the responsibility of the ETL developers to decide what data should be available, depending on the business needs, and to build and maintain that ETL process by getting access to the source systems and their underlying databases.

And so, this is where the bottleneck was. The ETL developers had to control what data went in, and they were the only ones who could make data available in the warehouse. Data would only be made available if it met a strong business need, and that typically meant the only data in the warehouse was data that drove the company KPIs. If you wanted some data to do some analysis and it wasn’t already in there, you had to put a ticket in their backlog and hope for the best. If it did ever get prioritized, it was probably too late for what you wanted it for.

Note

Let’s illustrate how different roles worked together with this architecture with an example.

Our data generator, Vivianne, is a software engineer working on a service that writes its data to a database. She’s aware that some of the data from that database is extracted by a data analyst, Bukayo, and that is used to drive top-level business KPIs.

Bukayo can’t do much transformation on the data, due to the limitations of the technology and the cost of infrastructure, so the reporting he produces is largely on the raw data.

There are no defined expectations between Vivianne and Bukayo, and Bukayo relies on Vivianne telling him in advance whether there are any changes to the data or the schema.

The extraction is not reliable. The ETL process could affect the performance of the database, and so can be switched off when there is an incident. Schema and data changes are not always known in advance. The downstream database also has limited performance and cannot be easily scaled to deal with an increase in the data or usage.

Both Vivianne and Bukayo lack autonomy. Vivianne can’t change her database schema without getting approval from Bukayo. Bukayo can only get a subset of data, with little say over the format. Furthermore, any potential users downstream of Bukayo can only access the data he has extracted, severely limiting the accessibility of the organization’s data.

This won’t be the last time we see a bottleneck that prevents access to, and the use of, quality data. Let’s look now at the next generation of data architecture and the introduction of big data, which was made possible by the release of Apache Hadoop in 2006.

The big data platform

As the internet took off in the 1990s and the size and importance of data grew with it, the big tech companies started developing a new generation of data tooling and architectures that aimed to reduce the cost of storing and transforming vast quantities of data. In 2003, Google wrote a paper describing their Google File System, and in 2004 followed that up with another paper, titled MapReduce: Simplified Data Processing on Large Clusters. These ideas were then implemented at Yahoo! and open sourced as Apache Hadoop in 2006.

Apache Hadoop contained two core modules. The Hadoop Distributed File System (HDFS) gave us the ability to store almost limitless amounts of data reliably and efficiently on commodity hardware. Then the MapReduce engine gives us a model on which we could implement programs to process and transform this data, at scale, also on commodity hardware.

This led to the popularization of big data, which was the collective term for our reporting, ML, and analytics capabilities with HDFS and MapReduce as the foundation. These platforms used open source technology and could be on-premises or in the cloud. The reduced costs made this accessible to organizations of any size, who could either implement it themselves or use a packaged enterprise solution provided by the likes of Cloudera and MapR.

The following diagram shows the reference data platform architecture built upon Hadoop:

Figure 1.2 – The big data platform architecture

Figure 1.2 – The big data platform architecture

At the center of the architecture is the data lake, implemented on top of HDFS or a similar filesystem. Here, we could store an almost unlimited amount of semi-structured or unstructured data. This still needed to be put into an EDW in order to drive analytics, as data visualization tools such as Tableau needed a SQL-compatible database to connect to.

Because there were no expectations set on the structure of the data in the data lake, and no limits on the amount of data, it was very easy to write as much as you could and worry about how to use it later. This led to the concept of extract, load, and transform (ELT), as opposed to ETL, where the idea was to extract and load the data into the data lake first without any processing, then apply schemas and transforms later as part of loading to the data warehouse or reading the data in other downstream processes.

We then had much more data than ever before. With a low barrier to entry and cheap storage, data was easily added to the data lake, whether there was a consumer requirement in mind or not.

However, in practice, much of that data was never used. For a start, it was almost impossible to know what data was in there and how it was structured. It lacked any documentation, had no set expectations on its reliability and quality, and no governance over how it was managed. Then, once you did find some data you wanted to use, you needed to write MapReduce jobs using Hadoop or, later, Apache Spark. But this was very difficult to do – particularly at any scale – and only achievable by a large team of specialist data engineers. Even then, those jobs tended to be unreliable and have unpredictable performance.

This is why we started hearing people refer to it as the data swamp. While much of the data was likely valuable, the inaccessibility of the data lake meant it was never used. Gartner introduced the term dark data to describe this, where data is collected and never used, and the costs of storing and managing that data outweigh any value gained from it (https://www.gartner.com/en/information-technology/glossary/dark-data). In 2015, IDC estimated 90% of unstructured data could be considered dark (https://www.kdnuggets.com/2015/11/importance-dark-data-big-data-world.html).

Another consequence of this architecture was that it moved the end data consumers further away from the data generators. Typically, a central data engineering team was introduced to focus solely on ingesting the data into the data lake, building the tools and the connections required to do that from as many source systems as possible. They were the ones interacting with the data generators, not the ultimate consumers of the data.

So, despite the advance in tools and technologies, in practice, we still had many of the same limitations as before. Only a limited amount of data could be made available for analysis and other uses, and we had that same bottleneck controlling what that data was.

Note

Let’s return to our example to illustrate how different roles worked together with this architecture.

Our data generator, Vivianne, is a software engineer working on a service that writes its data to a database. She may or may not be aware that some of the data from that database is extracted in a raw form, and is unlikely to know exactly what the data is. Certainly, she doesn’t know why.

Ben is a data engineer who works on the ELT pipeline. He aims to extract as much of the data as possible into the data lake. He doesn’t know much about the data itself, or what it will be used for. He spends a lot of time dealing with changing schemas that break his pipelines.

Leah is another data engineer, specializing in writing MapReduce jobs. She takes requirements from data analysts and builds datasets to meet those requirements. She struggles to find the data she wants and needs to learn a lot about the upstream services and their data models in order to produce what she hopes is the right data. These MapReduce jobs have unpredictable performance and are difficult to debug. The jobs do not run reliably.

The BI analyst, Bukayo, takes this data and creates reports to support the business. They often break due to an issue upstream. There are no expectations defined at any of these steps, and therefore no guarantees on the reliability or correctness of the data can be provided to those consuming Bukayo’s data.

The data generator, Vivianne, is far away from the data consumer, Bukayo, and there is no communication. Vivianne has no understanding of how the changes she makes affect key business processes.

While Bukayo and his peers can usually get the data they need prioritized by Leah and Ben, those who are not BI analysts and want data for other needs lack the autonomy and the expertise to access it, preventing the use of data for anything other than the most critical business requirements.

The next generation of data architectures began in 2012 with the launch of Amazon Redshift on AWS and the explosion of tools and investment into what became known as the modern data stack (MDS). In the next section, we’ll explore this architecture and see whether we can finally get rid of this bottleneck.

The modern data stack

Amazon Redshift was the first cloud-native data warehouse and provided a real step-change in capabilities. It had the ability to store almost limitless data at a low cost in a SQL-compatible database, and the massively parallel processing (MPP) capabilities meant you could process that data effectively and efficiently at scale.

This sounds like what we had with Hadoop, but the key differences were the SQL compatibility and the more strongly defined structure of the data. This made it much more accessible than the unstructured files on an HDFS cluster. It also presented an opportunity to build services on top of Redshift and later SQL-compatible warehouses such as Google BigQuery and Snowflake, which led to an explosion of tools that make up today’s modern data stack. This includes ELT tools such as Fivetran and Stitch, data transformation tools such as dbt, and reverse ETL tools such as Hightouch.

These data warehouses evolved further to become what we now call a data lakehouse, which brings together the benefits of a modern data warehouse (SQL compatibility and high performance with MPP) with the benefits of a data lake (low cost, limitless storage, and support for different data types).

Into this data lakehouse went all the source data we ingested from our systems and third-party services, becoming our operational data store (ODS). From here, we could join and transform the data and make it available to our EDW, from where it is available for consumption. But the data warehouse was no longer a separate database – it was just a logically separate area of our data lakehouse, using the same technology. This reduced the effort and costs of the transforms and further increased the accessibility of the data.

The following diagram shows the reference architecture of the modern data stack, with the data lakehouse in the center:

Figure 1.3 – The modern data stack architecture

Figure 1.3 – The modern data stack architecture

This architecture gives us more options to ingest the source data, and one of those is using change data capture (CDC) tooling, for which we have open source implementations such as Debezium and commercial offerings such as Striim and Google Cloud Datastream, as well as in-depth write-ups on closed source solutions at organizations including Airbnb (https://medium.com/airbnb-engineering/capturing-data-evolution-in-a-service-oriented-architecture-72f7c643ee6f) and Netflix (https://netflixtechblog.com/dblog-a-generic-change-data-capture-framework-69351fb9099b). CDC tools connect to the transactional databases of your upstream servers and capture all the changes that happen to each of the tables (i.e., the INSERT, UPDATE, and DELETE statements run against the database). These are sent to the data lakehouse, and from there, you can recreate the database in the lakehouse with the same structure and the same data.

However, this creates a tight coupling between the internal models of the upstream service and database and the data consumers. As that service naturally evolves over time, breaking changes will be made to those models. When these happen – often without any notice – they impact the CDC service and/or downstream data uses, leading to instability and unreliability. This makes it impossible to build on this data with any confidence.

The data is also not structured well for analytical queries and uses. It has been designed to meet the needs of the service and to be optimal for a transactional database, not a data lakehouse. It can take a lot of transformation and joining to take this data and produce something that meets the requirements of your downstream users, which is time-consuming and expensive.

There is often little or no documentation for this data, and so to make use of it you need to have in-depth knowledge of those source systems and the way they model the data, including the history of how that has evolved over time. This typically comes from asking teams who work on that service or relying on institutional knowledge from colleagues who have worked with that data before. This makes it difficult to discover new or useful datasets, or for a new consumer to get started.

The root cause of all these problems is that this data was not built for consumption.

Many of these same problems apply to data ingested from a third-party service through an ELT tool such as Fivetran or Stitch. This is particularly true if you’re ingesting from a complex service such as Salesforce, which is highly customizable with custom objects and fields. The data is in a raw form that mimics the API of the third-party service, lacks documentation, and requires in-depth knowledge of the service to use. Like with CDC, it can still change without notice and requires a lot of transformation to produce something that meets your requirements.

One purported benefit of the modern data stack is that we now have more data available to us than ever before. However, a 2022 report from Seagate (https://www.seagate.com/gb/en/our-story/rethink-data/) found that 68% of the data available to organizations goes unused. We still have our dark data problem from the big data era.

The introduction of dbt and similar tools that run on a data lakehouse has made it easier than ever to process this data using just SQL – one of the most well-known and popular languages around. This should increase the accessibility of the data in the data lakehouse.

However, due to the complexity of the transforms required to make use of this data and the domain knowledge you must build up, we still often end up with a central team of data engineers to build and maintain the hundreds, thousands, or even tens of thousands of models required to produce data that is ready for consumption by other data practitioners and users.

Note

We’ll return to our example for the final time to illustrate how different roles work together with this architecture.

Our data generator, Vivianne, is a software engineer working on a service that writes its data to a database. She may or may not be aware that the data from that database is extracted in a raw form through a CDC service. Certainly, she doesn’t know why.

Ben is a data platform engineer who works on the CDC pipeline. He aims to extract as much of the data as possible into the data lakehouse. He doesn’t know much about the data itself, or what it will be used for. He spends a lot of time dealing with changing schemas that break his pipelines.

Leah is an analytics engineer building dbt pipelines. She takes requirements from data analysts and builds datasets to meet those requirements. She struggles to find the data she wants and needs to learn a lot about the upstream services and their data models in order to produce what she hopes is the right data. These dbt pipelines now number in the thousands and no one has all the context required to debug them all. The pipelines break regularly, and those breakages often have a wide impact.

The BI analyst, Bukayo, takes this data and creates reports to support the business. They often break due to an issue upstream. There are no expectations defined at any of these steps, and therefore no guarantees on the reliability or correctness of the data can be provided to those consuming Bukayo’s data.

The data generator, Vivianne, is far away from the data consumer, Bukayo, and there is no communication. Vivianne has no understanding or visibility of how the changes she makes affect key business processes.

While Bukayo and his peers can usually get the data they need prioritized by Leah and Ben, those who are not BI analysts and want data for other needs have access to the data in a structured form, but lack the domain knowledge to use it effectively. They lack the autonomy to ask for the data they need to meet their requirements.

So, despite the improvements in the technology and architecture over three generations of data platform architectures, we still have that bottleneck of a central team with a long backlog of datasets to make available to the organization before we can start using it to drive business value.

The following diagram shows the three generations side by side, with the same bottleneck highlighted in each:

Figure 1.4 – Comparing the three generations of data platform architectures

Figure 1.4 – Comparing the three generations of data platform architectures

It’s that bottleneck that has led us to the state of today’s data platforms and the trouble many of us face when trying to generate business value from our data. In the next section, we’re going to discuss the problems we have when we build data platforms on this architecture.

The state of today’s data platforms

The limitations of today’s data architectures, and the data culture they reinforce, result in several problems that are felt almost universally by organizations trying to get value from their data. Let’s explore the following problems in turn and the impact they have:

  • The lack of expectations
  • The lack of reliability
  • The lack of autonomy

The lack of expectations

Users working with source data that has been ingested through an ELT or CDC tool can have very few expectations about what the data is, how it should be used, and how reliable it will be. They also don’t know exactly where this data comes from, who generated it, and how it might change in the future.

In the absence of explicitly defined expectations, users tend to make assumptions that are more optimistic than reality, particularly when it comes to the reliability and availability of the data. This only increases the impact when there is a breaking change in the upstream data, or when that data proves to be unreliable.

It also leads to the data not being used correctly. For example, there could be different tables and columns that relate to the various dimensions around how a customer is billed for their use of the company’s products, and this will evolve over time. The data consumer will need to know that in detail if they are to use this data to produce revenue numbers for the organization. They therefore need to gain in-depth knowledge of the service and the logic it uses so they can reimplement that in their ETL.

Successfully building applications and services on top of the data in our lakehouse would require the active transfusion of this knowledge from the upstream data generators to the downstream data consumers, including the following:

  • The domain models the dataset describes
  • The change history of the dataset
  • The schematics and metadata

However, due to the distance between these groups, there is no feasible way to establish this exchange.

This lack of expectations, and no requirement to fulfill them, is also a problem for the data generators. Often, they don’t even know they are data generators, as they are just writing data to their internal models in their services database or managing a third-party service as best they can to meet their direct users requirements. They are completely unaware of the ELT/CDC processes running to extract their data and its importance to the rest of the organization. This makes it difficult to hold them responsible for the changes they make and their downstream impact, as it is completely invisible to them and often completely unexpected. So, the responsibility falls entirely on the data teams attempting to make use of this data.

This lack of responsibility is shown in the following diagram, which is the same as we saw in the The modern data stack section earlier but annotated with responsibility.

Figure 1.5 – Responsibility in the modern data stack

Figure 1.5 – Responsibility in the modern data stack

This diagram also illustrates another of the big problems with today’s data platforms, which is the complete lack of collaboration between the data generators and the data consumers. The data generators are far removed from the consumption points and have little to no idea of who is consuming their data, why they need the data, and the important business processes and outcomes that are driven by that data. On the other side, the data consumers don’t even know who is generating the data they depend on so much and have no say in what that data should look like in order to meet their requirements. They simply get the data they are given.

The lack of reliability

Many organizations suffer from unreliable data pipelines and have done for years. This could be at a significant cost, with a Gartner survey (https://www.gartner.com/smarterwithgartner/how-to-stop-data-quality-undermining-your-business) suggesting these cost companies millions of dollars a year.

There are many reasons for this unreliability. It could be the lack of quality of the data when ingested, or how the quality of that data has degraded over time as it becomes stale. Or the data could be late or incomplete.

The root cause of so many of these reliability problems is that we are building on data that was not made for consumption.

As mentioned earlier, data being ingested through ELT and CDC tools can change at any time, without warning. These could be schema changes, which typically cause the downstream pipelines to fail loudly with no new data being ingested or populated until the issue has been resolved. It could also be a change to the data itself, or the logic required to use that data correctly. These are often silent failures and may not be automatically detected. The first time we might hear about the issue is when a user brings up some data, maybe as part of a presentation or a meeting, and notices it doesn’t look quite right or looks different to how it did yesterday.

Often, these changes can’t be fixed in the source system. They were made for a good reason and have already been deployed to production. That leaves the data pipeline authors to implement a fix within the pipeline, which in the best case is just pointing to another column but more likely ends up being yet another CASE statement with logic to handle the change, or another IFNULL statement, or IF DATE < x THEN do this ELSE do that. This builds and builds over time, creating ever more complex and brittle data pipelines, and further increasing their unreliability.

All the while, we’re increasing the number of applications built on this data and adding more and more complexity to these pipelines, which again further increases the unreliability.

The cost of these reliability issues is that users lose trust in the data, and once that trust is lost it’s very hard to win back.

The lack of autonomy

For decades we’ve been creating our data platforms with a bottleneck in the middle. The team, typically a central data engineering or BI engineering team, are the only ones who have the ability and the time to attempt to make use of the raw source data, with everyone else consuming their data.

Anyone wanting to have data made available to them will be waiting for that central team to prioritize that ask, with their ticket sitting in a backlog. These central teams will never have the capacity to keep up with these requests and instead can only focus on those deemed the highest priority, which are typically those data sources that drive the company KPIs and other top-level metrics.

That’s not to say the rest of the data does not have value! As we’ll discuss in the following section, it does, and there will be plenty of ways that data could be used to drive decisions or improve data-driven products across the organization. But this data is simply not accessible enough to the people who could make use of this data and therefore sits unused.

To empower a truly data-driven organization, we need to move away from the dependence on a central and limited data engineering team to an architecture that promotes autonomy, opening that dark data up to uses that will never be important enough to prioritize, but that when added up provide a lot of business value to the organization and support new applications that could be critical for its success.

This isn’t a technical limitation. Modern data lakehouses can be queried by anyone who knows SQL, and any data available in the lakehouse can be made available to any reporting tool for use by less technical users. It’s a limitation of the way we have chosen to ingest data through ELT, the lack of quality of that data, and the data culture that embodies.

As we’ll discuss in the next section, organizations are looking to gain a competitive advantage with the ever-increasing use of data in more and more business-critical applications. These limitations in our data architecture are no longer acceptable.

The ever-increasing use of data in business-critical applications

Despite all these challenges, data produced on a data platform is being increasingly used in business-critical applications.

This is for good reason! It’s well accepted that organizations that make effective use of data can gain a real competitive advantage. Increasingly, these are not traditional tech companies but organizations across almost all industries, as technology and data become more important to their business. This has led to organizations investing heavily in areas such as data science, looking to gain similar competitive advantages (or at least, not get left behind!).

However, for these data projects to be successful, more of our data needs to be accessible to people across the organization. We can no longer just be using a small percentage of our data to provide top-level business metrics and nothing more.

This can be clearly seen in the consumer sector, where to be competitive you must be providing a state-of-the-art customer experience, and that requires the atomic use of data at every customer touchpoint. A report from McKinsey (https://www.mckinsey.com/industries/retail/our-insights/jumpstarting-value-creation-with-data-and-analytics-in-fashion-and-luxury) estimated that the 25 top-performing retailers were digital leaders. They are 83% more profitable and took over 90% of the sector’s gains in market capitalization.

Many organizations are, of course, aware of this. An industry report by Anmut in 2021 (https://www.anmut.co.uk/wp-content/uploads/2021/05/Amnut-DLR-May2021.pdf) illustrated both the perceived importance of data to organizations and the problems they have utilizing it when it stated this in its executive summary:

We found that 91% of business leaders say data’s critical to their business success, 76% are investing in business transformation around data, and two-thirds of boards say data is a material asset.

Yet, just 34% of businesses manage data assets with the same discipline as other assets, and these businesses are reaping the rewards. This 34% spend most of their data investment creating value, while the rest spend nearly half of their budget fixing data.

It’s this lack of discipline in managing their data assets that is really harming organizations. It manifests itself in the lack of expectations throughout the pipeline and then permeates throughout the entire data platform and into those datasets within the data warehouse, which themselves also have ill-defined expectations for its downstream users or data-driven products.

The following diagram shows a typical data pipeline and how at each stage the lack of defined expectations ultimately results in the consumers losing trust in business-critical data-driven products:

Figure 1.6 – The lack of expectations throughout the data platform

Figure 1.6 – The lack of expectations throughout the data platform

Again, in the absence of these expectations, users will optimistically assume the data is more reliable than it is, but now it’s not just internal KPIs and reporting that are affected by the inevitable downtime but revenue-generating services affecting external customers. Just like internal users, they will start losing trust, but this time they are losing trust in the product and the company, which can eventually cause real damage to the company’s brand and reputation.

As the importance of data continues to increase and it finds its way into more business-critical applications, it becomes imperative that we greatly increase the reliability of our data platforms to meet the expectations of our users.

Summary

There’s no doubt that the effective use of data is becoming ever more critical to organizations. No longer is it only expected to drive internal reporting and KPIs, but the use of data is driving key products both internally and externally to customers.

However, while the tools we have available are better than ever, the architecture of the data platforms that underpin all of this have not evolved alongside them. Our data platforms continue to be hampered by a bottleneck that restricts the accessibility of the data. They are unable to provide the reliable, quality data that is needed to those teams who need it when it is needed.

We need to stop working around these problems within the data platform and address them at the source.

We need an architecture that sets expectations around what data is provided, how to use it, and how reliable it will be.

We need a data culture that treats data as a first-class citizen, where responsibility is assigned to those who generate the data.

And so, in the next chapter, we’ll introduce data contracts, a new architecture pattern designed to solve these problems, and provide the foundations we need to empower true data-driven organizations that realize the value of their data.

Further reading

For more information on the topics covered in this chapter, please see the following resources:

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Understand data contracts and their power to resolving the problems in contemporary data platforms
  • Learn how to design and implement a cutting-edge data platform powered by data contracts
  • Access practical guidance from the pioneer of data contracts to get expert insights on effective utilization

Description

Despite the passage of time and the evolution of technology and architecture, the challenges we face in building data platforms persist. Our data often remains unreliable, lacks trust, and fails to deliver the promised value. With Driving Data Quality with Data Contracts, you’ll discover the potential of data contracts to transform how you build your data platforms, finally overcoming these enduring problems. You’ll learn how establishing contracts as the interface allows you to explicitly assign responsibility and accountability of the data to those who know it best—the data generators—and give them the autonomy to generate and manage data as required. The book will show you how data contracts ensure that consumers get quality data with clearly defined expectations, enabling them to build on that data with confidence to deliver valuable analytics, performant ML models, and trusted data-driven products. By the end of this book, you’ll have gained a comprehensive understanding of how data contracts can revolutionize your organization’s data culture and provide a competitive advantage by unlocking the real value within your data.

Who is this book for?

If you’re a data engineer, data leader, architect, or practitioner thinking about your data architecture and looking to design one that enables your organization to get the most value from your data, this book is for you. Additionally, staff engineers, product managers, and software engineering leaders and executives will also find valuable insights.

What you will learn

  • Gain insights into the intricacies and shortcomings of today's data architectures
  • Understand exactly how data contracts can solve prevalent data challenges
  • Drive a fundamental transformation of your data culture by implementing data contracts
  • Discover what goes into a data contract and why it's important
  • Design a modern data architecture that leverages the power of data contracts
  • Explore sample implementations to get practical knowledge of using data contracts
  • Embrace best practices for the successful deployment of data contracts

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jun 30, 2023
Length: 206 pages
Edition : 1st
Language : English
ISBN-13 : 9781837636242
Category :
Languages :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Jun 30, 2023
Length: 206 pages
Edition : 1st
Language : English
ISBN-13 : 9781837636242
Category :
Languages :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 98.97
Data Modeling with Snowflake
€37.99
Practical Data Quality
€26.99
Driving Data Quality with Data Contracts
€33.99
Total 98.97 Stars icon
Banner background image

Table of Contents

15 Chapters
Part 1: Why Data Contracts? Chevron down icon Chevron up icon
Chapter 1: A Brief History of Data Platforms Chevron down icon Chevron up icon
Chapter 2: Introducing Data Contracts Chevron down icon Chevron up icon
Part 2: Driving Data Culture Change with Data Contracts Chevron down icon Chevron up icon
Chapter 3: How to Get Adoption in Your Organization Chevron down icon Chevron up icon
Chapter 4: Bringing Data Consumers and Generators Closer Together Chevron down icon Chevron up icon
Chapter 5: Embedding Data Governance Chevron down icon Chevron up icon
Part 3: Designing and Implementing a Data Architecture Based on Data Contracts Chevron down icon Chevron up icon
Chapter 6: What Makes Up a Data Contract Chevron down icon Chevron up icon
Chapter 7: A Contract-Driven Data Architecture Chevron down icon Chevron up icon
Chapter 8: A Sample Implementation Chevron down icon Chevron up icon
Chapter 9: Implementing Data Contracts in Your Organization Chevron down icon Chevron up icon
Chapter 10: Data Contracts in Practice Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.8
(11 Ratings)
5 star 81.8%
4 star 18.2%
3 star 0%
2 star 0%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




Camille Jul 12, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Experiencing a "Where has this book been my whole life" kind of moment reading this book.
Amazon Verified review Amazon
Gordon W. Hamilton Aug 29, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This is an important book on how to practically achieve Data Quality for Data Engineers and Architects. There are major tides and currents roiling the data world and Andrew's pragmatic descriptions of cascading Data Contracts help connect the dots between Data Mesh's Data Products, Data Observability, Data Governance, and Data Quality.Sometimes it is hard to see the forest for the burning DQ leaves - this book helps give you the perspective you need.
Amazon Verified review Amazon
Peter O'Kelly Sep 22, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Driving Data Quality with Data Contracts lives up to its ambitious subtitle as "A comprehensive guide to building reliable, trusted, and effective data platforms".Metaplane CEO Kevin Hu notes, in his foreword: "Whether you’re a data practitioner who is tired of being blamed for data quality issues or a business stakeholder who wants to promote data trust , this book is the gold standard for learning about data contracts."Author Andrew Jones, credited with coining the term "data contract", addresses both technical and social/organizational aspects of data contracts and also includes a sample implementation.The book starts with a brief history of data platforms, explaining how the need for data contracts evolved, and continues to highlight how data contracts fit into other recent concepts such as data products and data mesh.
Amazon Verified review Amazon
Mojeed Abisiga Aug 15, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book is a game-changer. Andrew Jones really simplified complex concepts, unveiled the power of Data Contracts, Observability, and Governance. A must-read for anyone serious about getting quality data and successful insights. Kudos to the author for demystifying data excellence.
Amazon Verified review Amazon
E. Stratton Jan 30, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
In a world where so much innovation and new technology and approaches are actually rehashes of things that have already come and gone, data contracts is a genuinely problem-solving and transformative technical and architectural paradigm for the management, enrichment, use, and documentation of business data. The book explains it clearly and approachably, and gives excellent historical insight into the evolution of data analytics over the past 20 years and why it has led to the need for data contracts.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.