You're reading from Data Observability for Data Engineering Proactive strategies for ensuring data accuracy and addressing broken data pipelines

Product type Paperback

Published in Dec 2023

Publisher Packt

ISBN-13 9781804616024

Length 228 pages

Edition 1st Edition

Languages

Python

Tools

SQL Server

Concepts

Data Engineering

Authors (2):

Michele Pinto

Sammy El Khammal

View More author details

Table of Contents (17) Chapters

Preface

1. Part 1: Introduction to Data Observability

2. Chapter 1: Fundamentals of Data Quality Monitoring FREE CHAPTER

3. Chapter 2: Fundamentals of Data Observability

4. Part 2: Implementing Data Observability

5. Chapter 3: Data Observability Techniques

6. Chapter 4: Data Observability Elements

7. Chapter 5: Defining Rules on Indicators

8. Part 3: How to adopt Data Observability in your organization

9. Chapter 6: Root Cause Analysis

10. Chapter 7: Optimizing Data Pipelines

11. Chapter 8: Organizing Data Teams and Measuring the Success of Data Observability

12. Part 4: Appendix

13. Chapter 9: Data Observability Checklist

14. Chapter 10: Pathway to Data Observability

15. Index

Why subscribe?

16. Other Books You May Enjoy

Identifying information bias in data

Let’s talk about a sneaky problem in the world of data: information bias. This bias arises from a misalignment between data producers and consumers. When the expectations and understanding of data quality are not in sync, information bias manifests, distorting the data’s reliability and integrity. This section will unpack the concept of information bias in the context of data quality, exploring how discrepancies in producers’ and consumers’ perspectives can skew the data landscape. By delving into the roles, relationships, and responsibilities of these key stakeholders, we’ll shed light on the technical intricacies that underpin a successful data-driven ecosystem.

Data is a primary asset of a company’s intelligence. It allows companies to get insights, drive projects, and generate value. At the genesis of all data-driven projects, there is a business need:

Creating a sales report to evaluate top-performing employees
Evaluating the churn of young customers to optimize marketing efforts
Forecasting tire sales to avoid overstocking

These projects rely on a data pipeline, a succession of applications that manipulate raw data to create the final output, often in the form of a report:

Figure 1.2 – Example of a data pipeline

In each case, the produced data serves the interests of a consumer, which can be, among others, a manager, an analyst, or a decision-maker. In a data pipeline, the applications or processes, such as Flink or Power BI in Figure 1.2, consume and produce data sources, such as JSON files or SQL databases.

There are several stakeholders in a pipeline at each step or application: the producers on one hand and the consumers on the other hand. Let’s look at these stakeholders in detail.

Data producers

The producer creates the data and makes it available to other stakeholders. By definition, a producer is not the final user of the data. It can be a data engineering team serving the data science team, an analyst serving the board of managers, or a cross-functional team that produces data products available for the organization. In our pipeline, for instance, an engineer coding the Spark ingestion job is a producer.

As a data producer, you are responsible for the content you serve, and you are concerned about maintaining the right level of service for your consumers. Data producers also need to create more projects to fulfill a maximum amount of needs coming from various teams, so producers need to deal with maintaining quality for existing projects and delivering new projects.

As a data producer, you have to maintain a high level of service. This can be achieved by doing the following:

Defining clear data quality targets: Understand what is required to maintain high quality, and communicate those standards to all the data source stakeholders
Ensuring those targets are met thanks to a robust validation process: Put the quality targets into practice and verify the quality of the data, from extraction to transformation and delivery
Keeping accurate and up-to-date data documentation: Document how the process modified the data with instruments such as data lineage and metrics
Collaborating with the data consumers: Ensure you set the right quality standards so that you can correctly maintain them and adapt to evolving needs

We emphasize that collaboration with consumers is key to fulfilling the producer’s responsibilities. Let’s see the other end of the value chain: data consumers.

Data consumers

The consumer uses the data created by one or several producers. It can be a user, a team, or another application. They may or may not be the final user of the data, or may just be the producer of another dataset. This means that a consumer can become a producer and vice versa. Here are some examples of consumers:

Data or business analysts: They use data produced by the producers to extract insights that will support business decisions
Business managers: They use the data to make (strategic) decisions or follow indicators
Other producers: The consumer is a new intermediary in the data value chain who uses produced data to create new datasets

A consumer needs correct data, and this is where data quality enters the picture. As a consumer, you are dependent on the job done by the producers, especially because your inputs, whether they need to feed another application or a business report, depend directly on the outputs of the producers. Let’s look at the different interactions among the stakeholders.

The relationship between producers and consumers

Both producers and consumers are interdependent. Consumers need the raw materials from the producers as inputs and can create inputs for other producers.

Besides this, a producer can have several dependent consumers. For instance, the provider of a data lake will create data that will be the backbone of multiple projects in the data science team.

Conversely, a consumer can use various producers’ data to create their data product. Take the example of a churn model, a machine learning project that aims to identify the customers who are about to leave the contract. Those models will use the Customer Relationship Management (CRM) data from the company, but will also rely on external sources such as the International Monetary Fund (IMF) to extract the GDP per capita.

In a data pipeline, these two roles are alternating and a consumer can easily become a producer (see Figure 1.3). In well-structured data-driven companies, it is often the case where a team will be responsible for collecting data, another team will ingest the data in the master data, and a data analyst team will use it to create reports. The following figure depicts a data pipeline where consumers and producers are the stakeholders:

Figure 1.3 – A view of a pipeline as a succession of producers and consumers

As you can see, a pipeline can become complex in terms of responsibilities when several stakeholders are involved.

For a consumer, the quality of the data is key. It is about getting the right tools at the right time, such as in an assembly line. If you work in a car manufacturing company, you won’t expect to receive flat tires when it’s your turn to work on the car (or maybe it is worse if the chain is late or completely stopped). You can, of course, control the quality of the tires on your own, when you receive them from the tire manufacturer, before putting them on the car. Nevertheless, at this stage, the issue is detected too late as it will eventually slow down car production.

In these interconnected pipelines, issues may arise once the quality of the data doesn’t meet the consumers’ expectations. It can be even worse if those issues are detected too late, once the decision has already been taken, leading to disastrous business impacts. As a result, the trust of the consumers in the whole data pipeline is eroded, and the data producer becomes more hesitant to deploy the new data application to production, significantly lowering the time to market.

In a large-scale company, even small quality issues at the beginning of the pipeline can have bad consequences on the outcome. Without a good data quality process, teams lose days and months firefighting issues. Finding the cause of an issue or even detecting the issue itself can be painful. The consumer may detect the issue and come back to the producer, asking them to fix the pipeline as soon as possible – not to say immediately. However, without good-quality processes, you may spend days analyzing complex data pipelines and asking for permission to read data from other teams.

Asymmetric information among stakeholders

While the goal of each stakeholder is clear – producers want to send the highest quality data to consumers, and consumers want the best quality standard for their data – who is responsible for the data quality is not clear. Consumers expect data to be of good quality, and that this quality is ensured and backed up by the producers. On the contrary, producers expect their consumers to validate and control the quality of data they deserve. This results in a misalignment of objectives and responsibilities.

This is at the root of what we describe as information bias between producers and consumers. They both have asymmetric information about data quality. This is a situation where one party has more or better information than the other, which can lead to an imbalance of power or an unfair advantage. The producer wants to deliver quality defined by the customer but needs to receive defined and accurate expectations from them.

The consumer knows the important metrics they want to follow. However, it requires good communication within data teams to ensure these parameters are understood by the producers.

There is also a shared responsibility paradigm: while the data producers bear the responsibility for ensuring quality, the consumers play an important role in providing feedback and setting clear expectations. This shared responsibility is also key to fostering a good data quality culture inside the organization.

Data quality is paramount because it’s the cornerstone of trust in any data-driven decision-making process. Just like a house needs a solid foundation to stand, decisions need reliable data to be sound. When data quality is compromised, everything built on top of it is at risk.

With that, we have defined why data quality is important and how it can enforce relationships in a company. Now, we’ll learn what data quality is by exploring the seven dimensions of data quality.