Data Observability for Data Engineering

Fundamentals of Data Quality Monitoring

Welcome to the exciting world of Data Observability for Data Engineering!

As you open the pages of this book, you will embark on a journey that will immerse you in data observability. The knowledge within this book is designed to equip you, as a data engineer, data architect, data product owner, or data engineering manager, with the skills and tools necessary to implement best practices in your data pipelines.

In this book, you will learn how data observability can help you build trust in your organization. Observability provides insights directly from within the process, offering a fresh approach to monitoring. It’s a method for determining whether the pipeline is functioning properly, especially in terms of adhering to its data quality standards.

Let’s get real for a moment. In our world, where we’re swimming in data, it’s easy to feel like we’re drowning. Data observability isn’t just some fancy term – it’s your life raft. Without it, you’re flying blind, making decisions based on guesswork. Who wants to be in that hot seat when data disasters strike? Not you.

This book isn’t just another item on your reading list; it’s the missing piece in your data puzzle. It’s about giving you the superpower to spot the small issues in your data before they turn into full-blown catastrophes. Think about the cost, not just in dollars, but in sleepless nights and lost trust, when data incidents occur. Scary, right?

But here’s the kicker: data observability isn’t just about avoiding nightmares; it’s about building a foundation of trust. When your data’s in check, your team can make bold, confident decisions without that nagging doubt. That’s priceless.

Data observability is not just a buzzword – we are deeply convinced it is the backbone of any resilient, efficient, and reliable data pipeline. This book will take you on a comprehensive exploration of the core principles of data observability, the techniques you can use to develop an observability approach, the challenges faced when implementing it, and the best practices being employed by industry leaders. This book will be your compass in the vast universe of data observability by providing you with various examples that allow you to bridge the gap between theory and practice.

The knowledge in this book is organized into four essential parts. In part one, we will lay the foundation by introducing the fundamentals of data quality monitoring and how data observability takes it to the next level. This crucial groundwork will ensure you understand the core concepts and will set the stage for the next topics.

In part two, we will move on to the practical aspects of implementing data observability. You will dive into various techniques and elements of observability and learn how to define rules on indicators. This part will provide you with the skills to apply data observability in your projects.

The third part will focus on adopting data observability at scale in your organization. You will discover the main benefits of data observability by learning how to conduct root cause analysis, how to optimize pipelines, and how to foster a culture change within your team. This part is essential to ensure the successful implementation of a data observability program.

Finally, the fourth part will contain additional resources focused on data engineering, such as a data observability checklist and a technical roadmap to implement it, leaving you with strong takeaways so that you can stand on your own two feet.

Let’s start with a hypothetical scenario. You are a data engineer, coming back from your holidays and ready to start the quarter. You have a lot of new projects for the year. However, the second you reach your desktop, Lucy from the marketing team calls out to you: “The marketing report of last month is totally wrong – please fix it ASAP. I need to update my presentation!”

This is annoying; all the work that’s been scheduled for the day is delayed, and you need to check the numbers. You open your Tableau dashboard and start a Zoom meeting with the marketing team. The first task of the day: understand what she meant by wrong. Indeed, the turnover seems odd. It’s time for you to have a look at the SQL database feeding the dashboard. Again, you see the same issue. This is strange and will require even more investigation.

After hours of manual and tedious checks, contacting three different teams and sending 12 emails, you finally found the culprit: an ingestion script, feeding the company’s master database, was modified to express the turnover in thousands of dollars instead of units. Because the data team didn’t know that the metric would be used by the marketing team, the information did not pass and the pipeline was fed with the wrong data.

It’s not the first time this has happened. Hours of productivity are ruined by firefighting data issues. It’s decided – you need to implement a new strategy to avoid this.

Observability is intimately correlated with the notions of data quality. The latter is often defined as a way of measuring data indicators. Data quality is one thing, but monitoring it is something else! Through this chapter, we will explore the principles of data quality and understand how those can guide you on the data observability journey and how the information bias between stakeholders is key to understanding the need for data quality and observability in the data pipeline.

Data quality comes from the need to ensure correct and sustainable data pipelines. We will look at the different stakeholders of a data pipeline and describe why they need data quality. We will also define data quality through several concepts, which will lead to you understanding how a common base can be created between stakeholders.

By the end of this chapter, you will understand how data quality can be monitored and turned into metrics, preparing the ground for data observability.

In this chapter, we’ll cover the following topics:

Learning about the maturity path of data in companies
Identifying information bias in data
Exploring the seven dimensions of data quality
Turning data quality into SLAs
Indicators of data quality
Alerting on data quality issues

Learning about the maturity path of data in companies

The relationship between companies and data started a long time ago, at least from the end of the 1980s, with the first large diffusion of computers in offices. Since computers and data became more and more widespread in subsequent years, the usage of data in companies has gone through a very long period, of at least two decades, during which investments in data have grown, but this was done linearly. We cannot speak of a data winter, but we can consider it as a long wait for the spring that led to the explosion of data investments that we have experienced since the second half of the 2000s. This period was interrupted by at least three fundamental factors:

The collapse of the cost of the resources necessary to historicize and process data (memories and CPUs)
The advent of IoT devices, the widespread access to the internet, and the subsequent tsunami of available data
The diffusion and accessibility of relatively simple and advanced technologies dedicated to processing large amounts of data, such as Spark, Delta Lake, NoSQL databases, Hive, and Kafka

When these three fundamental pillars became accessible, the most attentive companies embarked on a complex path in the world of data, a maturity path that is still ongoing today, with several phases, each with its challenges and problems:

Figure 1.1 – An example of the data maturity path

Each company started this path differently, but usually, the first problem to solve was managing the continuously growing availability of data coming from increasingly popular applications, such as websites for e-commerce, social platforms, or the gaming industry, as well as apps for mobile devices. The solution to these problems has been to invest in small teams of software engineers who have experimented with the use and integration of big data technologies and platforms, among which there’s Hadoop with its main components, HDFS, MapReduce, and YARN, which are responsible for historicizing enormous volumes of data, processing them, and managing the resources, respectively, all in a distributed system. The more recent advanced technologies, such as Spark, Flink, Kafka, NoSQL, and Parquet, provided a further boost to this process. These software engineers were unaware that they were the first generation of a new role that is now one of the most popular and in-demand roles in software engineering – the data engineer.

These primal teams have often been seen as research and development teams and the expectations of them have grown with increasing investments. So, the next step was to ask how these teams could express their potential. Consequently, the step after that was to invest in an analytics team that could work alongside or as a consumer of the data engineers’ team. The natural way to start extracting value from data was with the adoption of advanced analytics and the introduction of techniques and solutions based on machine learning. Then, companies began to acquire a corporate culture and appreciate the great potential and competitiveness that data could provide. Whether they realized it or not, they were becoming data-driven companies, or at least data-informed; in the meanwhile, data began to be taken seriously – as a real asset, a critical component, and not just a mysterious box from which to take some insight only when strictly necessary.

The first results and the constant growth of the available data triggered a real race that has pushed companies to invest more and more in personnel and data technologies. This has led to the proliferation of new roles (data product manager, data architect, machine learning engineer, and so on) and the explosion of data experts in the company, which led to new and unexplored organizational problems. The centralized data team model revealed all its limits in terms of scalability and the lack of readiness to support the real problems of the business. Therefore, the process of decentralizing these data experts has begun and, in addition to solving these issues, has introduced new challenges, such as the need to adopt data governance processes and methodologies. Consequently, with this decentralization and having the data more and more central in the company, paired with the need to increase the skills of data quality, what was only important yesterday is becoming more and more of a priority today: to govern and monitor the quality of data.

The spread of teams and data in companies has led to an increase in the data culture in companies. Interactions between decentralized actors are increasingly entrusted via contracts that various teams make between them. Data is no longer seen as an unreliable dark object to rely on if necessary. Each team works daily with data, and data is now a real product that must comply with quality standards that are on par with any other product generated in the company. The quality of data is of extreme importance; it is no longer one problem of many, it is the problem.

In this section, we learned about the data maturity path that many companies are facing and understood the reasons that are pushing companies to invest more and more in data quality.

In the next section, we will understand how to identify information bias in data, introduce the roles of data producers and data consumers, and cover the expectations and responsibilities of these two actors toward data quality.

Identifying information bias in data

Let’s talk about a sneaky problem in the world of data: information bias. This bias arises from a misalignment between data producers and consumers. When the expectations and understanding of data quality are not in sync, information bias manifests, distorting the data’s reliability and integrity. This section will unpack the concept of information bias in the context of data quality, exploring how discrepancies in producers’ and consumers’ perspectives can skew the data landscape. By delving into the roles, relationships, and responsibilities of these key stakeholders, we’ll shed light on the technical intricacies that underpin a successful data-driven ecosystem.

Data is a primary asset of a company’s intelligence. It allows companies to get insights, drive projects, and generate value. At the genesis of all data-driven projects, there is a business need:

Creating a sales report to evaluate top-performing employees
Evaluating the churn of young customers to optimize marketing efforts
Forecasting tire sales to avoid overstocking

These projects rely on a data pipeline, a succession of applications that manipulate raw data to create the final output, often in the form of a report:

Figure 1.2 – Example of a data pipeline

In each case, the produced data serves the interests of a consumer, which can be, among others, a manager, an analyst, or a decision-maker. In a data pipeline, the applications or processes, such as Flink or Power BI in Figure 1.2, consume and produce data sources, such as JSON files or SQL databases.

There are several stakeholders in a pipeline at each step or application: the producers on one hand and the consumers on the other hand. Let’s look at these stakeholders in detail.

Data producers

The producer creates the data and makes it available to other stakeholders. By definition, a producer is not the final user of the data. It can be a data engineering team serving the data science team, an analyst serving the board of managers, or a cross-functional team that produces data products available for the organization. In our pipeline, for instance, an engineer coding the Spark ingestion job is a producer.

As a data producer, you are responsible for the content you serve, and you are concerned about maintaining the right level of service for your consumers. Data producers also need to create more projects to fulfill a maximum amount of needs coming from various teams, so producers need to deal with maintaining quality for existing projects and delivering new projects.

As a data producer, you have to maintain a high level of service. This can be achieved by doing the following:

Defining clear data quality targets: Understand what is required to maintain high quality, and communicate those standards to all the data source stakeholders
Ensuring those targets are met thanks to a robust validation process: Put the quality targets into practice and verify the quality of the data, from extraction to transformation and delivery
Keeping accurate and up-to-date data documentation: Document how the process modified the data with instruments such as data lineage and metrics
Collaborating with the data consumers: Ensure you set the right quality standards so that you can correctly maintain them and adapt to evolving needs

We emphasize that collaboration with consumers is key to fulfilling the producer’s responsibilities. Let’s see the other end of the value chain: data consumers.

Data consumers

The consumer uses the data created by one or several producers. It can be a user, a team, or another application. They may or may not be the final user of the data, or may just be the producer of another dataset. This means that a consumer can become a producer and vice versa. Here are some examples of consumers:

Data or business analysts: They use data produced by the producers to extract insights that will support business decisions
Business managers: They use the data to make (strategic) decisions or follow indicators
Other producers: The consumer is a new intermediary in the data value chain who uses produced data to create new datasets

A consumer needs correct data, and this is where data quality enters the picture. As a consumer, you are dependent on the job done by the producers, especially because your inputs, whether they need to feed another application or a business report, depend directly on the outputs of the producers. Let’s look at the different interactions among the stakeholders.

The relationship between producers and consumers

Both producers and consumers are interdependent. Consumers need the raw materials from the producers as inputs and can create inputs for other producers.

Besides this, a producer can have several dependent consumers. For instance, the provider of a data lake will create data that will be the backbone of multiple projects in the data science team.

Conversely, a consumer can use various producers’ data to create their data product. Take the example of a churn model, a machine learning project that aims to identify the customers who are about to leave the contract. Those models will use the Customer Relationship Management (CRM) data from the company, but will also rely on external sources such as the International Monetary Fund (IMF) to extract the GDP per capita.

In a data pipeline, these two roles are alternating and a consumer can easily become a producer (see Figure 1.3). In well-structured data-driven companies, it is often the case where a team will be responsible for collecting data, another team will ingest the data in the master data, and a data analyst team will use it to create reports. The following figure depicts a data pipeline where consumers and producers are the stakeholders:

Figure 1.3 – A view of a pipeline as a succession of producers and consumers

As you can see, a pipeline can become complex in terms of responsibilities when several stakeholders are involved.

For a consumer, the quality of the data is key. It is about getting the right tools at the right time, such as in an assembly line. If you work in a car manufacturing company, you won’t expect to receive flat tires when it’s your turn to work on the car (or maybe it is worse if the chain is late or completely stopped). You can, of course, control the quality of the tires on your own, when you receive them from the tire manufacturer, before putting them on the car. Nevertheless, at this stage, the issue is detected too late as it will eventually slow down car production.

In these interconnected pipelines, issues may arise once the quality of the data doesn’t meet the consumers’ expectations. It can be even worse if those issues are detected too late, once the decision has already been taken, leading to disastrous business impacts. As a result, the trust of the consumers in the whole data pipeline is eroded, and the data producer becomes more hesitant to deploy the new data application to production, significantly lowering the time to market.

In a large-scale company, even small quality issues at the beginning of the pipeline can have bad consequences on the outcome. Without a good data quality process, teams lose days and months firefighting issues. Finding the cause of an issue or even detecting the issue itself can be painful. The consumer may detect the issue and come back to the producer, asking them to fix the pipeline as soon as possible – not to say immediately. However, without good-quality processes, you may spend days analyzing complex data pipelines and asking for permission to read data from other teams.

Asymmetric information among stakeholders

While the goal of each stakeholder is clear – producers want to send the highest quality data to consumers, and consumers want the best quality standard for their data – who is responsible for the data quality is not clear. Consumers expect data to be of good quality, and that this quality is ensured and backed up by the producers. On the contrary, producers expect their consumers to validate and control the quality of data they deserve. This results in a misalignment of objectives and responsibilities.

This is at the root of what we describe as information bias between producers and consumers. They both have asymmetric information about data quality. This is a situation where one party has more or better information than the other, which can lead to an imbalance of power or an unfair advantage. The producer wants to deliver quality defined by the customer but needs to receive defined and accurate expectations from them.

The consumer knows the important metrics they want to follow. However, it requires good communication within data teams to ensure these parameters are understood by the producers.

There is also a shared responsibility paradigm: while the data producers bear the responsibility for ensuring quality, the consumers play an important role in providing feedback and setting clear expectations. This shared responsibility is also key to fostering a good data quality culture inside the organization.

Data quality is paramount because it’s the cornerstone of trust in any data-driven decision-making process. Just like a house needs a solid foundation to stand, decisions need reliable data to be sound. When data quality is compromised, everything built on top of it is at risk.

With that, we have defined why data quality is important and how it can enforce relationships in a company. Now, we’ll learn what data quality is by exploring the seven dimensions of data quality.

Exploring the seven dimensions of data quality

Now that you understand why data quality is important, let’s learn how to measure the quality of datasets.

A lot has already been explained about data quality in the literature – for instance, we can cite the DAMA (Data Management Body of Knowledge) book. In this chapter, we have decided to focus on seven dimensions: accuracy, completeness, consistency, conformity, integrity, timeliness, and uniqueness:

Figure 1.4 – Dimensions of data quality

In the following sections, we will cover these dimensions in detail and explain the potential business impacts of poor-quality data in those dimensions.

Accuracy

The accuracy of data defines its ability to reflect the actual data. In a CRM, the contract type of a customer should be correctly associated with the right customer ID. Otherwise, marketing action could be wrongly targeted. For instance, in Figure 1.5, where the dataset was wrongly copied into the second one, you can see that the email addresses were mixed up:

Figure 1.5 – Example of inaccurate data

Completeness

Data is considered complete if it contains all the data needed for the consumers. It is about getting the right data for the right process. A dataset can be incomplete if it does not contain an expected column, or if missing values are present. However, if the removed column is not used by your process, you can evaluate the dataset as complete for your needs. In Figure 1.6, the page_visited column is missing, while other columns are missing values. This is very annoying for the marketing team, who are in charge of sending emails, as they cannot contact all their customers:

Figure 1.6 – Example of incomplete data

The preceding case is a clear example of where data producers’ incentives can be different from customers’. Maybe the producer left empty cells to increase sales conversions as filling in the email address may create friction. However, for a consumer using the data for an email campaign, this field is crucial.

Consistency

The way data is represented should be consistent across the system. If you record the addresses of customers and need the ZIP code for the model, the records have to be consistent, meaning that you will not record the city for some customers and the ZIP code for others. At a technical level, the presentation of data can be inconsistent too. Look at Figure 1.7 – the has_confirmed column recorded Booleans as numbers for the first few rows and then used strings.

In this example, we can suppose the data source is a file, where fields can be easily changed. In a relational database management system (RDMS), this issue can be avoided as the data type cannot be changed:

Figure 1.7 – Example of inconsistent data

Conformity

Data should be collected in the right format. For instance, page_visited can only contain integer numbers, order_id should be a string of characters, and in another dataset, a ZIP code can be a combination of letters and numbers. In Figure 1.8, you would expect to see @ in the email address, but you can only see a username:

Figure 1.8 – Example of improper data

Integrity

Data transformation should ensure that data items keep the same relationship. Integrity means that data is connected correctly and you don’t have standalone data – for instance, an address not connected to any customer name. Integrity ensures the data in the dataset can be traced and connected to other data.

In Figure 1.9, an engineer has extracted the duration column, which is useless without order_id:

Figure 1.9 – Example of an integrity issue

Timeliness

Time is also an important dimension of data quality. If you run a weekly report, you want to be sure that the data is up to date and that the process runs at the correct time. For instance, if a weekly sales report is created, you expect to receive a report on last week’s sales. If you receive outdated data, because the database was not updated with the new week’s data, you may see that the total of this week’s report is the same as last week’s, which will lead to wrong assumptions.

Time-sensitive data, if not delivered on time, can lead to inaccurate insights, misinformed decisions, and, ultimately, monetary losses.

Uniqueness

To ensure there is no ambiguous data, the uniqueness dimension is used. A data record should not be duplicated and should not contain overlaps. In Figure 1.10, you can see that the same order ID was used for two distinct orders, and an order was recorded twice, assuming that there is no primary key defined in the dataset. This kind of data discrepancy can lead to various issues, such as incorrect order tracking, inaccurate inventory management, and potentially negative customer experiences:

Figure 1.10 – Example of duplicated data

Consequences of data quality issues

Data quality issues may have disastrous effects and negative consequences. When the quality standard is not met, the consumer can face various consequences.

Sometimes, a report won’t be created because the data quality issue will result in a pipeline issue, leading to delays in the decision-making process. Other times, the issue can be more subtle. For instance, because of a completeness issue, the marketing team could send emails with "Hello {firstname}", destroying their professional appearance to the customers. The result can damage the company’s profit or reputation.

Nevertheless, it is important to note that not all data quality issues will lead to a catastrophic outcome. Indeed, the issue only happens if the data item is part of your pipeline. The consumer won’t experience issues with data they don’t need. However, this means that to ensure the quality of the pipeline, the producer needs to know what is done with the data, and what is considered important for the data consumer. A data source that’s important for your project may have no relevance for another project.

This is why in the first stages of the project, a new approach must be implemented. The producer and the consumer have to come together to define the quality expectations of the pipeline. In the next section, we will see how this can be implemented with service-level agreements (SLAs) and service-level objectives (SLOs).

Turning data quality into SLAs

Trust is an important element for the consumer. The producer is not always fully aware of the data they are treating. Indeed, the producer can be seen as an executor creating an application on behalf of the consumer. They are not a domain expert and work in a black-box environment: on the one hand, the producer doesn’t exactly understand the objectives of their work, while on the other hand, the consumer doesn’t have control over the supply chain.

This can involve psychological barriers. The consumer and the producer may have never met. For instance, the head of sales, who is waiting for their weekly sales report, doesn’t know the data engineer in charge of the Spark ingestion in the data lake. The level of trust between the different stakeholders can be low from the beginning and the relationship can be poisonous and compromised. This kind of trust issue is difficult to solve, and confidence is dramatically complex to restore. This triggers a negative vortex that can lead to very important effects in the company.

As stated in the Identifying information bias in the data section, there is a fundamental asymmetry of information and responsibilities between producers and consumers. To solve this, data quality must be applied as an SLA between the parties. Let’s delve into this key component of data quality.

An agreement as a starting point

A SLA serves as a contract between the producer and the consumer, establishing the expected data quality level of a data asset. Concretely, with a SLA, the data producer is committed to offering the level of quality the consumer expects from the data source. The goals of the SLA are manifold:

Firstly, it ensures awareness of the producer on the expected level of quality they must deliver.
Secondly, the contractual engagement asks the producers to assume their responsibilities regarding the quality of the delivered processes.
Thirdly, it enhances the trust of the consumer in the outcomes of the pipeline, easing the burden of cumbersome double checks. The consumer puts their confidence in the contract.

In essence, data quality can be viewed as an SLA between the data producer and the consumer of the data. The SLA is not something you measure by nature, but this will drive the team’s ability to define proper objectives, as we will see in the next section.

The incumbent responsibilities of producers

To support those SLAs, the producer must establish SLOs. Those objectives are targets the producer sets to meet the requirements of the SLA.

However, it is important to stress that those quality targets can be different for each contract. Let’s imagine a department store conducting several data science projects. The marketing team oversees two data products based on the central data lake of all the store sales. The first one is a machine learning model that forecasts the sales of the latest children’s toy, while the second one is a report of all cosmetic products. The number of kids in the household may be an important feature for the first model, while it won’t be of any importance for the second report. The SLAs that are linked to the first project are different from the SLAs of the second one. However, the producer is responsible for providing both project teams with a good set of data. This said, the producer has to summarize all agreements to establish SLOs that will fulfill all their commitments.

This leads to the notion of vertical SLAs versus transversal SLOs. The SLA is considered vertical as it involves a one-to-one relationship between the producer and the consumer. In contrast, the SLO is transversal as, for a specific dataset, a SLO will be used to fulfill one or several SLAs. Figure 1.11 illustrates these principles. The producer is engaged in a contractual relationship with the customer, while the SLO directs the producers’ monitoring plan. We can say that a data producer, for a specific dataset, has a unique set of SLOs that fit all the SLAs:

Figure 1.11 – SLAs and SLOs in a data pipeline

The agreement will concern a bilateral relationship between a producer and their consumer. An easy example is as follows: “To run my churn prediction model, I need all the transactions performed by 18+ people in the last 2 weeks.” In this configuration, the data producer knows that the data of the CRM must contain all transactions from the last 2 weeks of people aged 18 and over. They can now create the following objectives:

The data needs to contain transactions from the last 2 weeks
The data needs to contain transactions of customers who are 18 and over

We’ll see how indicators can be set to validate the SLO later.

The SLA is a contract between both parties, whereas the SLO is set by the producer to ensure they can respect all SLAs. This means that an SLO can serve several agreements. If another project requires the transaction data of 18+ customers for 3 weeks, the data team could easily adjust its target and set the limit to 3.

The advantage of thinking of data quality as a service is that, by knowing which dimension of quality your consumers need, you can provide the right adjusted level of service to achieve the objectives. As we’ll see later, it will help a lot regarding costs and resource optimization.

Considerations for SLOs and SLAs

To ensure SLAs and SLOs remain effective and relevant, it is important to focus on different topics.

First, as business requirements change over time, it is essential for both parties to regularly review the existing agreements and adjust objectives accordingly. Collaboration and communication are key to ensuring the objectives are always well aligned with consumer expectations.

Documentation and traceability are also important when it comes to agreements and objectives. By documenting those, all parties can maintain transparency and keep track of changes over time. This can be helpful in case of misalignment or misunderstandings.

Also, to encourage producers to respect their agreements, incentives and penalties can be activated, regardless of whether the agreements are met.

What is important and at the core of the SLA/SLO framework is being able to measure the performance against the established SLOs. It is crucial to set up metrics that will assess the data producer’s performance.

To be implemented, objectives have to be aligned with indicators. In the next section, we will see which indicators are important for measuring data quality.

Indicators of data quality

Once objectives have been defined, we need a way to assess their validity and follow up on the quality of data on a day-to-day basis.

To do so, indicators are must-haves. Indeed, an SLO will be useless without any indicator. Those indicators can also be called service-level indicators (SLIs). An indicator is a defined measure of data quality. An indicator can be a metric, a state, or a key performance indicator (KPI). At this stage, data quality is activated and becomes data quality monitoring. The goal of the indicator is to assess the validity of your objectives. It is the producer’s responsibility to check whether an indicator behaves well.

Depending on the objective, many indicators can be put in place. If a data application expects JSON as input, the format of the incoming data source becomes an important indicator. We will see techniques and methods for gathering these indicators in Chapter 3 and learn how a data model can be used to collect those elements in Chapter 4.

Data source metadata

We define metadata as data on data. It can be the file location, the format, or even the owner of the dataset. These can be pertinent indicators for executing data applications. If a program expects CSV file format to be triggered but the format has changed upstream to JSON, the objectives can be jeopardized.

Schema

The expected schema of a data source is an important indicator. An application or a pipeline can simply break if the input schema is not correct. The important characteristics of a schema are its field names and field types.

A wrong type, or a deleted column, can lead to missing values in the outcome, or broken applications. Sometimes, the issue is even more sneaky. A Python AI model will only require a feature matrix without column names. If two columns are interchanged, the model will use the wrong value associated with its coefficients and will lead to misleading results.

Lineage

Lineage or data mapping at the application level allows you to express the transformations of data you need in the process. This lineage permits an overview of the data field’s usage. Monitoring this lineage can prevent any change in the code base of the application.

Data lineage describes the data flow inside the application. It is the best documentation about what’s happening inside each step of a pipeline. This lineage allows you to create a data usage catalog, where you can see who accessed the data, how it was used, and what was created out of it.

Thanks to lineage, you can manage the risk of modifying the application over time as you can easily understand which other applications, data sources, and users will be impacted.

Lineage is also a good indicator if you wish to know if what’s happening inside the application can be done or not. Imagine a data pipeline in a bank that grants loans to its customers. The project team may choose to rely on the automatic decision of an AI model. For ethical and GDPR reasons, you do not want any of these decisions to rely on the gender of the customer. However, the feature matrix can be a mix of scores computed from other data sources. Thanks to lineage, you can validate how the feature matrix was created, and avoid any misuse of gender data you have in the data lake.

Application

Information about the application itself is also neglected but an important indicator. The code version (or tool version), as well as the timestamp of execution, can be valuable for detecting data quality issues. An application is a tool, a notebook, a script, or a piece of code that modifies the data. It is fueled by inputs and produces outputs.

An application running a wrong code version may use outdated data, leading to a timeliness issue. This is a perfect example of indicators that will serve a technical objective without a direct link to the agreement. The code version is often out of the scope of consideration for the business team, while it can have a lot of impact on the outcome.

Statistics and KPIs

Broadly speaking, there are many metrics – generic or not – that can become indicators in the context of data quality. Here, we draw a distinction between statistics and KPIs.

Statistics are a list of predefined metrics that you can compute on a dataset. KPIs are custom metrics that are often related to the accuracy of the dataset, which can also relate to a combination of datasets. Let’s deep dive into some of the main statistics as indicators:

Distribution: There are many ways to compute the distribution of a dataset feature. For numerical data, the minimum, maximum, mean, median, and other quantiles can be very good indicators. If the machine learning model is very sensitive to the distribution, skewness and kurtosis can also be considered to offer a better view of the shape of the data. For categorical data, the mode and frequency are valuable indicators.
Freshness: The freshness of a dataset is defined by several time-based metrics to define whether the data is fresh. This is done by using a combination of the frequency and the timeliness of the dataset:
- Frequency: This metric tells us at what time the dataset was updated or used and allows us to check whether the data was solicited on time. If the dataset is expected to be updated every day at noon and is late, data processes may be wrong if they’re launched before the availability of new data rows.
- Timeliness: This is a measure of the obsolescence of data. It can be computed by checking the timestamp of the data and the timestamp of the process using it. A reporting process could show no errors while the underlying data is outdated. If you want to be sure that the data you are using is last week’s data, you can use a timeliness indicator.
Completeness: Two indicators can help you check whether data is complete or not. The first one is the volume of data, while the second one is the missing values indicator:
- Volume: This is an indicator of the number of rows being processed in the dataset. The fact that the volume of data drops or sharply increases may express an underlying issue, and can signal an issue upstream or a change in the data collection process. If you expect to target 10 million customers but the query returns only 5,000, you may be facing a volume issue.
- Missing values: This is an indicator of the number or the percentage of null rows or null values in the dataset. A small percentage of missing values can be tolerated by the business team. However, for technical reasons, some machine learning models do not allow any missing values and will then return errors when running.

KPIs are custom metrics that can give an overview of the specific needs of the consumers. They can be business-centric or technical-driven. These KPIs are tailored to work with the expectations of the consumers and to ensure the robustness of the pipeline.

If the objective is to furnish accurate data for the quantity sold, you must set an indicator on the total, at which point you could create a Boolean value that will equal 1 if all the values are positive. This business-driven KPI will guarantee the accuracy of the data item.

A technical KPI can be, for instance, a difference in the number of rows of the inputs and outputs. To ensure the data is complete, the result of a full join operation should ensure that you end up with at least the same number of rows as the input data sources. If the difference is negative, you can suspect an issue in the completeness of the data.

A KPI can also be used to assess other dimensions of data quality. For instance, consistency is a dimension that often requires us to compare several data source items across the enterprise filesystem. If, in a Parquet partition, the amount column has three decimals but only one in another partition, a KPI can be computed and provide a score that can be 0 if the format of the data is not consistent.

Examples of SLAs, SLOs, and SLIs

Going back to our previous example, we defined the SLOs as follows:

The data needs to contain transactions from the last 2 weeks
The data needs to contain transactions of customers who are 18 and over

To define indicators on those SLOs, we can start by analyzing the dimensions we need to cover. In this example, the SLOs are related to the completeness, timeliness, and relevance of the data. Therefore, this is how you can set up SLIs:

Firstly, there is a need to send data promptly. A timeliness indicator can be introduced, making sure you use the data from the two latest weeks at the time of execution. You have to measure the difference between the latest data point and the current date. If the difference is within 14 days, you meet the SLI.
Secondly, to make sure the data is complete, you can compute the percentage of missing values of the dataset you create. A threshold can be added for the acceptable percentage of missing values, if any.
Thirdly, a KPI can be added to ensure the data contains 18+ consumers. For instance, it can be a volume indicator based on the condition that the age column is greater than 18. If this indicator remains 0, you meet the SLI.
Indicators can also be put on the schema to assure the consumer that the transaction amount is well defined and always a float number.

Now that you have defined proper indicators, it is high time you learn how to activate them. This is the essence of monitoring: being alerted about discovered issues. Configuring alerts to notify the relevant stakeholders whenever an SLI fails will help you detect and address issues promptly. Let’s see how this can be done with SLIs.

Alerting on data quality issues

Once indicators have been defined, it is important to set up systems to control and assess the quality of the data as per these indicators. An easy way to do so is by testing the quality with rules.

An indicator reflects the state of the system and is a proxy for one or several dimensions of data quality. Rules can be set to create a link between the indicators and the objective(s). For a producer, violating these rules is equivalent to a failed objective. Incorporating an alerting system aims to place the responsibility of detecting data quality issues in the producer’s hands.

An indicator can be the fuel of several rules. Indicators can also be used over time to create a rule system involving variations.

Using indicators to create rules

Collecting indicators is the first step toward monitoring. After that, to prevent data quality issues, you need to understand the normality of your data.

Indicators reflect the current state of the dataset. An indicator is a way of measuring data quality but does not assess the quality per se. To do so, you need to understand whether the indicator is valid or not.

Moreover, data quality indicators can also be used to prevent further issues and define other objectives linked to the agreement. Using lineage, you can define whether modifying indicators upstream in the data flow can have an impact on the SLA you want to support.

Rules can be established based on one indicator, several indicators, or even the observation of an indicator over time.

Rules using standalone indicators

A single indicator can be the source of detection of a major issue. An easy way to create a rule on an indicator is to set up an acceptable range for the indicator. If the data item has to represent an age, you can set up rules on the distribution indicators of minimum and maximum. You probably don’t want the minimum age to be a negative number and you don’t want the maximum to be exaggerated (let’s say more than 115 years old).

Rules using a combination of indicators

Several indicators can be used in a single rule to support an objective. The missing value indicator in a column can be influenced by the missing values of other datasets. In a CRM, the Full Name column can be a combination of the First Name and Surname columns. If one row of the First Name column is empty, there is a high probability that Full Name will also fail. In that case, setting a rule on the First Name column also ensures that the completeness objective of Full Name is fulfilled.

Rules based on a time series of indicators

The variation of an indicator over time should also be taken into consideration. This can be valuable in the completeness dimension, for instance. The volume of data, which is the number of rows you process in the application, can vary over months. This variation can be monitored, alerting the producer if there is a drop of more than 20% in the number of rows.

Rules should be the starting point of alerts. In turn, these alerts can be used to detect but also prevent any issue. When a rule detects an issue, it helps to ensure a trust relationship with the consumer as they will be able to assess the quality of the data before using it.

The data scorecard

With the rules associated with the data source, indicators and rules can be used to create a non-subjective data scorecard. This scorecard is an easy way for the business team to assess the quality of the data comprehensively. The scorecard can help the consumer share their issues with the producers, avoiding the traditional My data is broken, please fix it! issue. Instead, the consumer can stress the reason for the failing job – I’ve noticed a drop in the quality of my dataset: the percentage of null rows in the Age column exceeds 3%. It also helps the consumer understand the magnitude of the problem, and the producer to prioritize their work. You won’t react the same if the number of null rows is 2% as if the number of missing values has been bumped up by more than 300%.

The primary advantage of a scorecard is that it aims to increase the trustworthiness of the dataset. Even if the score is not the best one today, the user is reassured and knows that an issue will be detected by the producer itself. As a result, the latter gains in reputation. Creating a scorecard for the datasets you produce demonstrates your data maturity. It promotes a culture of continuous improvement within the team and organization.

Also, a scorecard helps in assessing data quality issues. By assigning weights to different dimensions of data quality, the scorecard allows you to prioritize aspects of data quality to ensure the most critical dimensions get the necessary focus.

This scorecard can be created per data usage, which means that a score can be associated with each SLA. We suggest the scorecard is a mirror of the data quality dimensions. Let’s look at some techniques for creating such a scorecard.

Creating a scorecard – the naïve way

To start with an easy implementation of the scorecard, you can use a percentage of the number of rules met (rules not being broken) over the total number of rules. This gives you a number between 0 and 100 and tells you how the data source behaved when it was last assessed.

Creating a scorecard – the extensive way

This scorecard is created based on one or several dimensions of data quality. For each dimension and each SLA, you can compute the score of the rules used to test those dimensions. To do so, follow these steps:

Identify the data quality dimensions relevant to the data source.
Assign weight, or importance, to each quality dimension based on its importance to the business objectives.
Define rules for each dimension you want to cover while considering the requirements of the SLAs.
Compute a score for each rule. You can use the naïve approach of counting the number of respected rules or you can use a more sophisticated approach.
Compute the weighted score by multiplying the score of each rule by the weight of its corresponding dimension and summing all the results.

By visualizing and tracking the created scores, you can easily share them with your stakeholders and compare data sources with each other, as well as detecting trends and patterns.

Let’s summarize what we’ve learned in this chapter.

Shuja Feb 22, 2024

The book starts by taking the reader through the fundamentals of Data monitoring - the data quality dimensions and the data quality indicators.The second part of the book dives deep into Data Observability and implementation in data pipelines with a thorough explanation of data observability techniques and elements.The last part of the book delves into how data observability ties into your organization and the value it provides in tackling day-to-day issues by data engineers.The authors have done a great job of highlighting the benefits and best practices of the lesser-focused pillar of data governance. The book will serve as a valuable resource for both beginners looking to learn about observability and experienced data engineers seeking reference.

Amazon Verified review

H2N Apr 01, 2024

The book presents the implementation and maintenance of data observability. It presents the best practices and strategies through real-world examples and case studies. The book is a great guide for data professionals and managers, for learning enhancing data quality and process optimization in any organization.