The big data platform
As the internet took off in the 1990s and the size and importance of data grew with it, the big tech companies started developing a new generation of data tooling and architectures that aimed to reduce the cost of storing and transforming vast quantities of data. In 2003, Google wrote a paper describing their Google File System, and in 2004 followed that up with another paper, titled MapReduce: Simplified Data Processing on Large Clusters. These ideas were then implemented at Yahoo! and open sourced as Apache Hadoop in 2006.
Apache Hadoop contained two core modules. The Hadoop Distributed File System (HDFS) gave us the ability to store almost limitless amounts of data reliably and efficiently on commodity hardware. Then the MapReduce engine gives us a model on which we could implement programs to process and transform this data, at scale, also on commodity hardware.
This led to the popularization of big data, which was the collective term for our reporting, ML, and analytics capabilities with HDFS and MapReduce as the foundation. These platforms used open source technology and could be on-premises or in the cloud. The reduced costs made this accessible to organizations of any size, who could either implement it themselves or use a packaged enterprise solution provided by the likes of Cloudera and MapR.
The following diagram shows the reference data platform architecture built upon Hadoop:
Figure 1.2 – The big data platform architecture
At the center of the architecture is the data lake, implemented on top of HDFS or a similar filesystem. Here, we could store an almost unlimited amount of semi-structured or unstructured data. This still needed to be put into an EDW in order to drive analytics, as data visualization tools such as Tableau needed a SQL-compatible database to connect to.
Because there were no expectations set on the structure of the data in the data lake, and no limits on the amount of data, it was very easy to write as much as you could and worry about how to use it later. This led to the concept of extract, load, and transform (ELT), as opposed to ETL, where the idea was to extract and load the data into the data lake first without any processing, then apply schemas and transforms later as part of loading to the data warehouse or reading the data in other downstream processes.
We then had much more data than ever before. With a low barrier to entry and cheap storage, data was easily added to the data lake, whether there was a consumer requirement in mind or not.
However, in practice, much of that data was never used. For a start, it was almost impossible to know what data was in there and how it was structured. It lacked any documentation, had no set expectations on its reliability and quality, and no governance over how it was managed. Then, once you did find some data you wanted to use, you needed to write MapReduce jobs using Hadoop or, later, Apache Spark. But this was very difficult to do – particularly at any scale – and only achievable by a large team of specialist data engineers. Even then, those jobs tended to be unreliable and have unpredictable performance.
This is why we started hearing people refer to it as the data swamp. While much of the data was likely valuable, the inaccessibility of the data lake meant it was never used. Gartner introduced the term dark data to describe this, where data is collected and never used, and the costs of storing and managing that data outweigh any value gained from it (https://www.gartner.com/en/information-technology/glossary/dark-data). In 2015, IDC estimated 90% of unstructured data could be considered dark (https://www.kdnuggets.com/2015/11/importance-dark-data-big-data-world.html).
Another consequence of this architecture was that it moved the end data consumers further away from the data generators. Typically, a central data engineering team was introduced to focus solely on ingesting the data into the data lake, building the tools and the connections required to do that from as many source systems as possible. They were the ones interacting with the data generators, not the ultimate consumers of the data.
So, despite the advance in tools and technologies, in practice, we still had many of the same limitations as before. Only a limited amount of data could be made available for analysis and other uses, and we had that same bottleneck controlling what that data was.
Note
Let’s return to our example to illustrate how different roles worked together with this architecture.
Our data generator, Vivianne, is a software engineer working on a service that writes its data to a database. She may or may not be aware that some of the data from that database is extracted in a raw form, and is unlikely to know exactly what the data is. Certainly, she doesn’t know why.
Ben is a data engineer who works on the ELT pipeline. He aims to extract as much of the data as possible into the data lake. He doesn’t know much about the data itself, or what it will be used for. He spends a lot of time dealing with changing schemas that break his pipelines.
Leah is another data engineer, specializing in writing MapReduce jobs. She takes requirements from data analysts and builds datasets to meet those requirements. She struggles to find the data she wants and needs to learn a lot about the upstream services and their data models in order to produce what she hopes is the right data. These MapReduce jobs have unpredictable performance and are difficult to debug. The jobs do not run reliably.
The BI analyst, Bukayo, takes this data and creates reports to support the business. They often break due to an issue upstream. There are no expectations defined at any of these steps, and therefore no guarantees on the reliability or correctness of the data can be provided to those consuming Bukayo’s data.
The data generator, Vivianne, is far away from the data consumer, Bukayo, and there is no communication. Vivianne has no understanding of how the changes she makes affect key business processes.
While Bukayo and his peers can usually get the data they need prioritized by Leah and Ben, those who are not BI analysts and want data for other needs lack the autonomy and the expertise to access it, preventing the use of data for anything other than the most critical business requirements.
The next generation of data architectures began in 2012 with the launch of Amazon Redshift on AWS and the explosion of tools and investment into what became known as the modern data stack (MDS). In the next section, we’ll explore this architecture and see whether we can finally get rid of this bottleneck.