Understanding a Modern Data Stack
As the name suggests, the MDS represents a technological evolution compared to previous systems widely used in recent decades. From the development of the business data warehouse in the 1980s to the rise of cloud technology with Amazon Web Services (AWS) in the early 2000s, on-premises legacy data stacks dominated the landscape. These systems had a monolithic IT infrastructure, resulting in complex maintenance. The MDS transformed this scenario – bringing modularity and cloud-native tools. However, before we dive into the details, let’s first define what a data stack is.
A data stack is a collection of tools and services as part of an extensive technology infrastructure designed to ingest, store, transform, and serve data. It makes data accessible across an organization and is fundamental to delivering business insights through reporting and dashboards, advanced analytics, and Machine Learning (ML) applications. Figure 2.1 illustrates an example of a high-level architecture of a data stack.
Figure 2.1 – An example of a high-level architecture of a data stack
Here, the data flows from left to right. The raw data is ingested, stored in a data warehouse, transformed, and finally, served to data analysts, data scientists, and business users.
Consequently, the MDS is nothing more than a subset of such architecture – a specific set of tools that democratizes access to the main functionalities of a data stack, reducing the complexity of implementation and improving the scalability of the data life cycle.
In the following table, we compare the main characteristics of the legacy data stack and the MDS.
Characteristic |
Legacy Data Stack |
Modern Data Stack |
Architecture |
Monolithic architecture |
Modular tools |
Servers |
On-premises servers |
Cloud-based |
Maintenance |
Complex – many resources required |
Simplified, managed solutions |
Programming languages |
Java/Scala/Python |
SQL-first |
Data ingestion |
ETL-focused |
ELT-focused |
Table 2.1 – A legacy versus Modern Data Stack comparison
Now that we have seen the definition of the MDS and how it compares to legacy stacks, it is time to see how it looks in practice. In Figure 2.2, an example is provided, with an overview of some of the tools that are used for key functionalities within this design:
Figure 2.2 – An example of an MDS
As we can see in Figure 2.2, the main blocks of an MDS can be considered as follows:
- Managed ingestion: This is responsible for the EL in ELT. It helps to streamline data extraction and ingestion through third-party-managed software applications.
- Data warehouse/lakehouse: These are cloud-based systems used to store large volumes of data.
- Transformation: This is what the T in ELT stands for. Transformations help in data cleaning and preparation to meet business intelligence needs.
- Orchestration: Orchestration helps in setting up specific tasks to run automatically at a particular event. As an example, an orchestration tool can be paired up with dbt Core for scheduled transformation.
- Self-service layer: In this block, reports and analysis are provided to business users, enabling these stakeholders to answer their questions and make data-informed decisions.
With managed ingestion, data teams are less dependent on the work of data engineers. Tools such as Stitch, Fivetran, and Airbyte empower analytics engineers to own end-to-end data pipelines, focusing on data modeling and transformation.
The list of tools and blocks in Figure 2.2 is not exhaustive. New tools and MDS companies emerge every year. Other common and important blocks are as follows:
- Data catalog tools are used for search and discovery tools, helping users to document and democratize access to business logic and data assets. Examples are Atlan, data.world, and DataHub.
- Data quality and observability tools are used to monitor pipelines and ensure the overall quality of data. Tool examples are Soda, Datafold, and Monte Carlo.
- Reverse ETL is used to retrieve data from a data warehouse and publish it to the systems used by business users, such as CRM software. Key market leaders are Census, Hightouch, and RudderStack.
We will now move our focus to how the modern data stack differs from the legacy stacks.