Learning about the maturity path of data in companies
The relationship between companies and data started a long time ago, at least from the end of the 1980s, with the first large diffusion of computers in offices. Since computers and data became more and more widespread in subsequent years, the usage of data in companies has gone through a very long period, of at least two decades, during which investments in data have grown, but this was done linearly. We cannot speak of a data winter, but we can consider it as a long wait for the spring that led to the explosion of data investments that we have experienced since the second half of the 2000s. This period was interrupted by at least three fundamental factors:
- The collapse of the cost of the resources necessary to historicize and process data (memories and CPUs)
- The advent of IoT devices, the widespread access to the internet, and the subsequent tsunami of available data
- The diffusion and accessibility of relatively simple and advanced technologies dedicated to processing large amounts of data, such as Spark, Delta Lake, NoSQL databases, Hive, and Kafka
When these three fundamental pillars became accessible, the most attentive companies embarked on a complex path in the world of data, a maturity path that is still ongoing today, with several phases, each with its challenges and problems:
Figure 1.1 – An example of the data maturity path
Each company started this path differently, but usually, the first problem to solve was managing the continuously growing availability of data coming from increasingly popular applications, such as websites for e-commerce, social platforms, or the gaming industry, as well as apps for mobile devices. The solution to these problems has been to invest in small teams of software engineers who have experimented with the use and integration of big data technologies and platforms, among which there’s Hadoop with its main components, HDFS, MapReduce, and YARN, which are responsible for historicizing enormous volumes of data, processing them, and managing the resources, respectively, all in a distributed system. The more recent advanced technologies, such as Spark, Flink, Kafka, NoSQL, and Parquet, provided a further boost to this process. These software engineers were unaware that they were the first generation of a new role that is now one of the most popular and in-demand roles in software engineering – the data engineer.
These primal teams have often been seen as research and development teams and the expectations of them have grown with increasing investments. So, the next step was to ask how these teams could express their potential. Consequently, the step after that was to invest in an analytics team that could work alongside or as a consumer of the data engineers’ team. The natural way to start extracting value from data was with the adoption of advanced analytics and the introduction of techniques and solutions based on machine learning. Then, companies began to acquire a corporate culture and appreciate the great potential and competitiveness that data could provide. Whether they realized it or not, they were becoming data-driven companies, or at least data-informed; in the meanwhile, data began to be taken seriously – as a real asset, a critical component, and not just a mysterious box from which to take some insight only when strictly necessary.
The first results and the constant growth of the available data triggered a real race that has pushed companies to invest more and more in personnel and data technologies. This has led to the proliferation of new roles (data product manager, data architect, machine learning engineer, and so on) and the explosion of data experts in the company, which led to new and unexplored organizational problems. The centralized data team model revealed all its limits in terms of scalability and the lack of readiness to support the real problems of the business. Therefore, the process of decentralizing these data experts has begun and, in addition to solving these issues, has introduced new challenges, such as the need to adopt data governance processes and methodologies. Consequently, with this decentralization and having the data more and more central in the company, paired with the need to increase the skills of data quality, what was only important yesterday is becoming more and more of a priority today: to govern and monitor the quality of data.
The spread of teams and data in companies has led to an increase in the data culture in companies. Interactions between decentralized actors are increasingly entrusted via contracts that various teams make between them. Data is no longer seen as an unreliable dark object to rely on if necessary. Each team works daily with data, and data is now a real product that must comply with quality standards that are on par with any other product generated in the company. The quality of data is of extreme importance; it is no longer one problem of many, it is the problem.
In this section, we learned about the data maturity path that many companies are facing and understood the reasons that are pushing companies to invest more and more in data quality.
In the next section, we will understand how to identify information bias in data, introduce the roles of data producers and data consumers, and cover the expectations and responsibilities of these two actors toward data quality.