Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Observability for Data Engineering

You're reading from   Data Observability for Data Engineering Proactive strategies for ensuring data accuracy and addressing broken data pipelines

Arrow left icon
Product type Paperback
Published in Dec 2023
Publisher Packt
ISBN-13 9781804616024
Length 228 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Michele Pinto Michele Pinto
Author Profile Icon Michele Pinto
Michele Pinto
Sammy El Khammal Sammy El Khammal
Author Profile Icon Sammy El Khammal
Sammy El Khammal
Arrow right icon
View More author details
Toc

Table of Contents (17) Chapters Close

Preface 1. Part 1: Introduction to Data Observability
2. Chapter 1: Fundamentals of Data Quality Monitoring FREE CHAPTER 3. Chapter 2: Fundamentals of Data Observability 4. Part 2: Implementing Data Observability
5. Chapter 3: Data Observability Techniques 6. Chapter 4: Data Observability Elements 7. Chapter 5: Defining Rules on Indicators 8. Part 3: How to adopt Data Observability in your organization
9. Chapter 6: Root Cause Analysis 10. Chapter 7: Optimizing Data Pipelines 11. Chapter 8: Organizing Data Teams and Measuring the Success of Data Observability 12. Part 4: Appendix
13. Chapter 9: Data Observability Checklist 14. Chapter 10: Pathway to Data Observability 15. Index 16. Other Books You May Enjoy

Learning about the maturity path of data in companies

The relationship between companies and data started a long time ago, at least from the end of the 1980s, with the first large diffusion of computers in offices. Since computers and data became more and more widespread in subsequent years, the usage of data in companies has gone through a very long period, of at least two decades, during which investments in data have grown, but this was done linearly. We cannot speak of a data winter, but we can consider it as a long wait for the spring that led to the explosion of data investments that we have experienced since the second half of the 2000s. This period was interrupted by at least three fundamental factors:

  • The collapse of the cost of the resources necessary to historicize and process data (memories and CPUs)
  • The advent of IoT devices, the widespread access to the internet, and the subsequent tsunami of available data
  • The diffusion and accessibility of relatively simple and advanced technologies dedicated to processing large amounts of data, such as Spark, Delta Lake, NoSQL databases, Hive, and Kafka

When these three fundamental pillars became accessible, the most attentive companies embarked on a complex path in the world of data, a maturity path that is still ongoing today, with several phases, each with its challenges and problems:

Figure 1.1 – An example of the data maturity path

Figure 1.1 – An example of the data maturity path

Each company started this path differently, but usually, the first problem to solve was managing the continuously growing availability of data coming from increasingly popular applications, such as websites for e-commerce, social platforms, or the gaming industry, as well as apps for mobile devices. The solution to these problems has been to invest in small teams of software engineers who have experimented with the use and integration of big data technologies and platforms, among which there’s Hadoop with its main components, HDFS, MapReduce, and YARN, which are responsible for historicizing enormous volumes of data, processing them, and managing the resources, respectively, all in a distributed system. The more recent advanced technologies, such as Spark, Flink, Kafka, NoSQL, and Parquet, provided a further boost to this process. These software engineers were unaware that they were the first generation of a new role that is now one of the most popular and in-demand roles in software engineering – the data engineer.

These primal teams have often been seen as research and development teams and the expectations of them have grown with increasing investments. So, the next step was to ask how these teams could express their potential. Consequently, the step after that was to invest in an analytics team that could work alongside or as a consumer of the data engineers’ team. The natural way to start extracting value from data was with the adoption of advanced analytics and the introduction of techniques and solutions based on machine learning. Then, companies began to acquire a corporate culture and appreciate the great potential and competitiveness that data could provide. Whether they realized it or not, they were becoming data-driven companies, or at least data-informed; in the meanwhile, data began to be taken seriously – as a real asset, a critical component, and not just a mysterious box from which to take some insight only when strictly necessary.

The first results and the constant growth of the available data triggered a real race that has pushed companies to invest more and more in personnel and data technologies. This has led to the proliferation of new roles (data product manager, data architect, machine learning engineer, and so on) and the explosion of data experts in the company, which led to new and unexplored organizational problems. The centralized data team model revealed all its limits in terms of scalability and the lack of readiness to support the real problems of the business. Therefore, the process of decentralizing these data experts has begun and, in addition to solving these issues, has introduced new challenges, such as the need to adopt data governance processes and methodologies. Consequently, with this decentralization and having the data more and more central in the company, paired with the need to increase the skills of data quality, what was only important yesterday is becoming more and more of a priority today: to govern and monitor the quality of data.

The spread of teams and data in companies has led to an increase in the data culture in companies. Interactions between decentralized actors are increasingly entrusted via contracts that various teams make between them. Data is no longer seen as an unreliable dark object to rely on if necessary. Each team works daily with data, and data is now a real product that must comply with quality standards that are on par with any other product generated in the company. The quality of data is of extreme importance; it is no longer one problem of many, it is the problem.

In this section, we learned about the data maturity path that many companies are facing and understood the reasons that are pushing companies to invest more and more in data quality.

In the next section, we will understand how to identify information bias in data, introduce the roles of data producers and data consumers, and cover the expectations and responsibilities of these two actors toward data quality.

You have been reading a chapter from
Data Observability for Data Engineering
Published in: Dec 2023
Publisher: Packt
ISBN-13: 9781804616024
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime