You're reading from Fundamentals of Analytics Engineering An introduction to building end-to-end analytics solutions

Product type Paperback

Published in Mar 2024

Publisher Packt

ISBN-13 9781837636457

Length 332 pages

Edition 1st Edition

Tools

dbt

Concepts

Data Analysis

Authors (7):

Dumky De Wilde

Ricardo Angel Granados Lopez

Lasse Benninga

Taís Laurindo Pereira

Jovan Gligorevic

Juan Manuel Perafan

Fanny Kassapian

+3 more

View More author details

Table of Contents (23) Chapters

Preface

1. Prologue

2. Part 1:Introduction to Analytics Engineering FREE CHAPTER

3. Chapter 1: What Is Analytics Engineering?

4. Chapter 2: The Modern Data Stack

5. Part 2: Building Data Pipelines

6. Chapter 3: Data Ingestion

7. Chapter 4: Data Warehousing

8. Chapter 5: Data Modeling

9. Chapter 6: Transforming Data

10. Chapter 7: Serving Data

11. Part 3: Hands-On Guide to Building a Data Platform

12. Chapter 8: Hands-On Analytics Engineering

13. Part 4: DataOps

14. Chapter 9: Data Quality and Observability

15. Chapter 10: Writing Code in a Team

16. Chapter 11: Automating Workflows

17. Part 5: Data Strategy

18. Chapter 12: Driving Business Adoption

19. Chapter 13: Data Governance

20. Chapter 14: Epilogue

21. Index

22. Other Books You May Enjoy

Understanding a Modern Data Stack

As the name suggests, the MDS represents a technological evolution compared to previous systems widely used in recent decades. From the development of the business data warehouse in the 1980s to the rise of cloud technology with Amazon Web Services (AWS) in the early 2000s, on-premises legacy data stacks dominated the landscape. These systems had a monolithic IT infrastructure, resulting in complex maintenance. The MDS transformed this scenario – bringing modularity and cloud-native tools. However, before we dive into the details, let’s first define what a data stack is.

A data stack is a collection of tools and services as part of an extensive technology infrastructure designed to ingest, store, transform, and serve data. It makes data accessible across an organization and is fundamental to delivering business insights through reporting and dashboards, advanced analytics, and Machine Learning (ML) applications. Figure 2.1 illustrates an example of a high-level architecture of a data stack.

Figure 2.1 – An example of a high-level architecture of a data stack

Here, the data flows from left to right. The raw data is ingested, stored in a data warehouse, transformed, and finally, served to data analysts, data scientists, and business users.

Consequently, the MDS is nothing more than a subset of such architecture – a specific set of tools that democratizes access to the main functionalities of a data stack, reducing the complexity of implementation and improving the scalability of the data life cycle.

In the following table, we compare the main characteristics of the legacy data stack and the MDS.

Characteristic	Legacy Data Stack	Modern Data Stack
Architecture	Monolithic architecture	Modular tools
Servers	On-premises servers	Cloud-based
Maintenance	Complex – many resources required	Simplified, managed solutions
Programming languages	Java/Scala/Python	SQL-first
Data ingestion	ETL-focused	ELT-focused

Table 2.1 – A legacy versus Modern Data Stack comparison

Now that we have seen the definition of the MDS and how it compares to legacy stacks, it is time to see how it looks in practice. In Figure 2.2, an example is provided, with an overview of some of the tools that are used for key functionalities within this design:

Figure 2.2 – An example of an MDS

As we can see in Figure 2.2, the main blocks of an MDS can be considered as follows:

Managed ingestion: This is responsible for the EL in ELT. It helps to streamline data extraction and ingestion through third-party-managed software applications.
Data warehouse/lakehouse: These are cloud-based systems used to store large volumes of data.
Transformation: This is what the T in ELT stands for. Transformations help in data cleaning and preparation to meet business intelligence needs.
Orchestration: Orchestration helps in setting up specific tasks to run automatically at a particular event. As an example, an orchestration tool can be paired up with dbt Core for scheduled transformation.
Self-service layer: In this block, reports and analysis are provided to business users, enabling these stakeholders to answer their questions and make data-informed decisions.

With managed ingestion, data teams are less dependent on the work of data engineers. Tools such as Stitch, Fivetran, and Airbyte empower analytics engineers to own end-to-end data pipelines, focusing on data modeling and transformation.

The list of tools and blocks in Figure 2.2 is not exhaustive. New tools and MDS companies emerge every year. Other common and important blocks are as follows:

Data catalog tools are used for search and discovery tools, helping users to document and democratize access to business logic and data assets. Examples are Atlan, data.world, and DataHub.
Data quality and observability tools are used to monitor pipelines and ensure the overall quality of data. Tool examples are Soda, Datafold, and Monte Carlo.
Reverse ETL is used to retrieve data from a data warehouse and publish it to the systems used by business users, such as CRM software. Key market leaders are Census, Hightouch, and RudderStack.

We will now move our focus to how the modern data stack differs from the legacy stacks.

The rest of the chapter is locked

You're reading from Fundamentals of Analytics Engineering An introduction to building end-to-end analytics solutions

Table of Contents (23) Chapters

Understanding a Modern Data Stack

Unlock this book and the full library FREE for 7 days

Authors (7)

Personalised recommendations for you