If, at this point, you are a bit confused with so many terms such as databricks, lakehouse, Databricks SQL, and more – worry not. We are just at the beginning of our learning journey. We will unpack all of these throughout this book.
First, what is Databricks?
Databricks is a platform that enables enterprises to quickly build their Data Lakehouse infrastructure and enable all data personas – data engineers, data scientists, and business intelligence personnel – in their organization to extract and deliver insights from the data. The platform provides a curated experience for each data persona, enabling them to execute their daily workflows. The foundational technologies that enable these experiences are open source – Apache Spark, Delta lake, MLflow, and more.
So, what is the Lakehouse architecture and why do we need it?
The Lakehouse architecture was formally presented at the Conference on Innovative Data Systems Research (CIDR) in January 2021. You can download it from https://databricks.com/research/lakehouse-a-new-generation-of-open-platforms-that-unify-data-warehousing-and-advanced-analytics. This is an easily digestible paper that I encourage you to read for the full details. That said, I will now summarize the salient points from this paper.
Attribution, Where it is Due
In my summary of the said research paper, I am recreating the images that were originally provided. Therefore, they are the intellectual property of the authors of the research paper.
According to the paper, most of the present-day data analytics infrastructures look like a two-tier system, as shown in the following diagram:
Figure 1.1 – Two-tier data analytics infrastructures
In this two-tier system, first, data from source systems is brought onto a data lake. Examples of source systems could be your web or mobile application, transactional databases, ERP systems, social media data, and more. The data lake is typically an on-premises HDFS system or cloud object storage. Data lakes allow you to store data in big data-optimized file formats such as Apache Parquet, ORC, and Avro. The use of these open file formats enables flexibility in writing to the data lake (due to schema-on-read semantics). This flexibility enables faster ingestion of data, which, in turn, enables faster access to data for end users. It also enables more advanced analytics use cases in ML and AI.
Of course, this architecture still needs to support the traditional BI workloads and decision support systems. Hence, a second process, typically in the form of Extract, Transform, and Load (ETL), is built to copy data from the data lake to a dedicated data warehouse.
Close inspection of the two-tier architecture reveals several systemic problems:
- Duplication of data: This architecture requires the same data to be present in two different systems. This results in an increased cost of storage. Constant reconciliation between these two systems is of utmost importance. This results in increased ETL operations and its associated costs.
- Security and governance: Data lakes and data warehouses have very different approaches to the security of data. This results in different security mechanisms for the same data that must always be in synchronization to avoid data security violations.
- Latency in data availability: In the two-tier architecture, the data is only moved to the warehouse by a secondary process, which introduces latency. This means analysts do not get access to fresh data. This also makes it unsuitable for tactical decision support such as operations.
- Total cost of ownership: Enterprises end up paying double for the same data. There are two storage systems, two ETL processes, two engineering debts, and more.
As you can see, this is unintuitive and unsustainable.
Hence, the paper presents the Lakehouse architecture as the way forward.
Simply put, the data lakehouse architecture is a data management system that implements all the features of data warehouses on data lakes. This makes the data lakehouse a single unified platform for business intelligence and advanced analytics.
This means that the lakehouse platform will implement data management features such as security controls, ACID transaction guarantees, data versioning, and auditing. It will implement query performance features such as indexing, caching, and query optimizations. These features are table stakes for data warehouses. The Lakehouse architecture brings these features to you in the flexible, open format data storage of data lakes. A Lakehouse is a platform that provides data warehousing capabilities and advanced analytics capabilities for the same platform, with cloud data lake economics.
What is the Formal Definition of the Lakehouse?
Section 3 in the CIDR paper officially defines the Lakehouse. Check it out.
The following is a visual depiction of the Lakehouse:
Figure 1.2 – Lakehouse architecture
The idea of the Lakehouse is deceptively simple – as all good things in life are! The Lakehouse architecture immediately solves the problems we highlighted about present-day two-tier architectures:
- A single storage layer means no duplication of data and no extra effort to reconcile data. Reduced ETL requirements and ACID guarantees equate to the stability and reliability of the system.
- A single storage layer means a single model of security and governance for all data assets. This reduces the risk of security breaches.
- A single storage layer means the availability of the freshest data possible for the consumers of the data.
- Cheap cloud storage with elastic, on-demand cloud compute reduces the total cost of ownership.
- Open source technologies in the storage layer reduce the chances of vendor lock-in and make it easy to integrate with other tools.
Of course, any implementation of the Lakehouse will have to ensure the following:
- Reliable data management: The Lakehouse proposes to eliminate (or reduce) data warehouses. Hence, the Lakehouse implementation must efficiently implement data management and governance – features that are table stakes in data warehouses.
- SQL performance: The Lakehouse will have to provide state-of-the-art SQL performance on top of the open-access filesystems and file formats typical in data lakes.
This is where the Databricks Lakehouse platform, and within it, the Databricks SQL product, comes in.