Distinguishing between Data Warehouses and Data Lakes
There are several definitions of Data Warehousing on the internet. The narrower ones characterize a warehouse as the database and the model used in the database; the wider descriptions look at the term as a method, and a suitable collection of all organizational and technological components that make up a BI solution. They talk about everything from the Extract Transform Load tool (ETL tool) to the database, the model, and, of course, the reporting and dashboarding solution.
Understanding Data Warehouse patterns
When we look at the Data Warehousing method in general, at its heart, we find a database that offers a certain table structure. We almost always find two main types of artifacts in the database: Facts and Dimensions.
Facts provide all the measurable information that we want to analyze; for example, the quantities of products sold per customer, per region, per sales representative, and per time. Facts are normally quite narrow objects, but with a lot of rows stored.
In the Dimensions, we will find all the descriptive information that can be linked to the Facts for analysis. Every piece of information that a user puts on their report or dashboard to aggregate and group the fact data, filter it, and view it is collected in the Dimensions. All the data related to customer information, such as Product, Contract, Address, and so on, that might need to be analyzed and correlated is stored here. Typically, these objects are stored as tables in the database and are joined using their key columns. Dimensions are normally wide objects, sometimes with controlled redundancy, that look at the given modeling method.
Three main methods for modeling the Facts and Dimensions within a Data Warehouse database have crystalized over the years of its evolution:
- Star-Join/Snowflake: This is probably the most famous method for Data Warehouse modeling. Fact tables are put in the center of the model, while Dimension tables are arranged around them, inheriting their Primary Key into the Fact table. In the Star-Join method, we find a lot of redundancy in the tables since as all the Dimension data, including all hierarchical information (such as Product Group -> Product SubCategory -> Product Category) regarding a certain artifact (Product, Customer, and so on), is stored in one table. In a Snowflake schema, hierarchies are spread in additional tables per hierarchy level and are linked over relationships with each other. This, when expressed in a graph, turns out to show a kind of snowflake pattern.
- Data Vault: This is a newer method that reflects the rising structural volatility of data sources, which offers higher flexibility and speed for developing. Entities that need to be analyzed are stored over Hubs, Satellites, and Links. Hubs simply reflect the presence of an entity by storing its ID and some audit information such as its data source, create times, and so on. Each hub can have one or more Satellite(s). These Satellites store all the descriptive information about the entity. If we need to change the system and add new information about an entity, we can add another Satellite to the model, reflecting just the new data. This will bring the benefit of non-destructive deployments to the productive system in the rollout. In the Data Vault, the Customer data will be stored in one Hub (the CustomerID and audit columns) and one or more Satellites (the rest of the customer information). The structure in the model is finally brought by Links. They are provided with all the Primary Keys of the Hubs and, again, some metadata. Additionally, Links can have Satellites of their own that describe the relationships of the Link content. Therefore, the connection between a Customer and the Products they bought will be reflected in a Link, where the Customer-Hub-Key and the Product-Hub-Key are stored together with audit columns. A Link Satellite can, for example, reflect some characteristics of the relationship, such as the amount bought, the date, or a discount. Finally, we can even add a Star-Join-View schema to abstract all the tables of the Data Vault and make it easier to understand for the users.
- 3rd Normal form: This is the typical database modeling technique that is also used in (and first created for) so-called Online Transactional Processing (OLTP) databases. Artifacts are broken up into their atomic information and are spread over several tables so that no redundancy is stored in either table. The Product information of a system might be split in separate tables for the product's name, color, size, price information, and many more. To derive all the information of a 3rd Normal Form model, a lot of joining is necessary.
Investigating ETL/ELT
But how does data finally land in the Data Warehouse database? The process and the related tools are named Extract, Transform, Load (ETL), but depending on the sequence we implement it in, it may be referred to as ELT. You'll find several possible ways to implement data loading into a Data Warehouse. This can be done by implementing specialized ETL tools such as Azure Data Factory in the cloud, SQL Server Integration services (SSIS), Informatica, Talend, or IBM Data Stage, for example.
The biggest advantage of these tools is the availability of wide catalogues of ready-to-use source and target connectors. They can connect directly to a source, query the needed data, and even transform it while being transported to the target. In the end, data is loaded into the Data Warehouse database. Other advantages include its graphical interfaces, where complex logic can be implemented on a "point-and-click" basis, which is very easy to understand and maintain.
There are other options as well. Data is often pushed by source applications and a direct connection for the data extraction process is not wanted at all. Many times, files are provided that are stored somewhere near the Data Warehouse database and then need to be imported. Maybe there is no ETL tool available. Since nearly every database nowadays provides loader tools, the import can be accomplished using those tools in a scripted environment. Once the data has made its way to the database tables, the transformational steps are done using Stored Procedures that then move the data through different DWH stages or layers to the final Core Data Warehouse.
Understanding Data Warehouse layers
Talking about the Data Warehouse layers, we nearly always find several steps that are processed before the data is provided for reporting or dashboarding. Typically, there are at least the following stages or layers:
- Landing or staging area: Data is imported into this layer in its rawest format. Nothing is changed on its way to this area and only audit information is added to track down all the loads.
- QS, transient, or cleansing area: This is where the work is done. You will only find a few projects where data is consistent. Values, sometimes even mandatory ones from the source, may be missing, content might be formatted incorrectly or even corrupted, and so on. In this zone, all the issues with the source data are taken care of, and the data is curated and cleansed.
- Core Data Warehouse: In this part of the database, the clean data is brought into the data model of choice. The model takes care of data historization, if required. This is also one possible point of security – or, even better, the central point of security. Data is secured so that all participating users can only see what they can see. The Core Data Warehouse is also the place where performance is boosted. By using all the available database techniques, such as indexes, partitioning, compression, and so on, the Data Warehouse will be tuned to suit the performance needs of the reporting users. The Core Data Warehouse is often also seen as the presentation layer of the system, where all kinds of reporting and dashboarding and self-service BI tools can connect and create their analysis.
Additionally, Data Warehouses have also brought in other types of layers. There might be sublayers for the cleansing area, for example, or the Data Marts, which are used to slice data semantically for the needs of certain user groups, or the Operational Data Store (ODS), which has different definitions, depending on who you ask. Sometimes, it is used as an intermediary data store that can be used for reporting. Other definitions speak of a transient zone that stores data for integration and further transport to the DWH. Sometimes, it is also used as a kind of archived data pool that the Data Warehouse can always be reproduced from. The definitions vary.
Implementing reporting and dashboarding
In terms of the reporting/dashboarding solution, we will also find several approaches to visualize the information that's provided by the Data Warehouse. There are the typical reporting tools, such as SQL Server Reporting Service or Crystal Reports, for example. These are page-oriented report generators that access the underlying database, get the data, and render it according to the template being used.
The more modern approach, however, has resulted in tools that can access a certain dataset, store it in an internal caching database, and allow interactive reporting and dashboarding based on that data. Self-Service BI tools such as Power BI, QLIK, and Tableau allow you to access the data, create your own visuals, and put them together in dashboards. Nowadays, these tools can even correlate data from different data sources and allow you to analyze data that might not have made it to the Data Warehouse database yet.
Loading bigger amounts of data
You can scale databases with newer, faster hardware, more memory, and faster disks. You can also go from Symmetric Multi-Processing (SMP) to Massively Parallel Processing (MPP) databases. However, the usual restrictions still apply: the more data we need to process, the longer it will take to do so. And there are also workloads that databases will not support, such as image processing.
Starting with Data Lakes
It's funny that we find a similar mix or maybe even call it confusion when we examine the term Data Lake. Many people refer to a Data Lake as a Hadoop Big Data implementation that delivers one or more clusters of computers, where a distributed filesystem and computational software is installed. It can deal with a distributed Input and Output (I/O) on the one hand but can also do distributed and parallel computation. Such a system adds specialized services for all kinds of workloads, be it just SQL queries against the stored data, in-memory computation, streaming analytics – you name it. Interestingly, as we mentioned while discussing Data Warehouses discussion, the narrower definition of the Data Lake only refers to the storage solution, and less to the method and the collection of services. To be honest, I like the wider definition far better, as it describes the method more holistically.
With the Hadoop Distributed File System (HDFS) from the Apache Software Foundation, a system is delivered that is even capable of storing data distributed over cheap legacy hardware clusters. HDFS splits the files into blocks and will also replicate those blocks within the system. This not only delivers a failsafe environment, but it also generates the biggest advantage of HDFS: parallel access to the data blocks for the consuming compute components. This means that a Spark cluster, for example, can read several data blocks in parallel, and with increased speed, because it can access several blocks of a file with several parallel threads.
With the possibility to hand files to a filesystem and start analyzing them just where they are, we enter another dogma.
MapReduce, the programming paradigm, supports parallel processing many files in this distributed environment. Every participating node runs calculations over a certain subset of all the files that must be analyzed. In the end, the results are aggregated by a driver node and returned to the querying instance. Different to the Data Warehouse, we can start analyzing data on a Schema-On-Read basis. This means we decide on the structure of the files that are processed, right when they are being processed. The Data Warehouse, in comparison, is based on a Schema-On-Write strategy, where the tables must be defined before the data is loaded into the database.
In the Data Lake world, you, as the developer, do not need to plan structures weeks ahead, as you would in the Data Warehouse. Imagine you have stored a vast quantity of files, where each row in a file will reflect an event, and each row also consists of, let's say, 100 columns. Using Schema-On-Read, you can just decide to only count the rows of all the files. You won't need to cut the rows into the columns for this process. For other purposes, you might need to split the rows during the reading process into their own columns to access the detailed information. You might need to predict a machine failure based on the content of the columns. Using a database, you can also mimic this behavior, but you would need to create different objects and selectively store data for each intention.
Understanding the Data Lake ecosystem
HDFS is the center of a Data Lake system. But we can't get further with just a filesystem, even one as sophisticated as this one. The Hadoop open source world around HDFS has therefore developed many services to interact and process the content that's kept in distributed storage. There are services such as Hive, a SQL-like Data Warehousing service; Pig, for highly parallel analysis jobs; Spark as a large-scale, in-memory analytical engine; Storm as a distributed real-time streaming analysis engine; Mahout for machine learning; Hbase as a database; Oozie, for workflows; and many more.
Spark clusters, with their ability to run distributed, in-memory analytical processes on distributed datasets, have influenced the market heavily in the recent years. Scala, one of the main programming languages used with Spark, is a Java-based programming language. It can be used to write high-performance routines for ETL/ELT, data cleansing, and transformation, but also for machine learning and artificial intelligence on massive datasets. Python also made it into this technology and is one of the go-to languages here. R, as the statistical programming language of choice for many data scientists, was also a must for the Spark offering. Dealing with data has been important for so many years and guess what – SQL is still a language that was not possible to skip in such an environment.
These services and languages are making the open source Hadoop ecosystem a rich, complex engine for processing and analyzing excessive amounts of data. They are available all over the clusters and can interact with the HDFS to make use of the distributed files and compute.
Tip
Jumping into a big data challenge should, just like a Data Warehouse project, always be a well-prepared and intensively examined project. Just starting it from an "I want that too!" position would be the worst driver you can have, and unfortunately happens far too often. When you start with purpose and find the right tool for that purpose, at least that selection might create the right environment for a successful project. Maybe we will find some alternatives for you throughout this book.
Comparing Data Lake zones
Very similar to the Data Warehouse layers, as we discussed previously, a Data Lake is also structured in different layers. These layers form the stages that the data must go through on its way to be formed into information. The three major zones are pretty much comparable. We can form a landing zone in the Data Lake, where data is written to "as-is" in the exact format as it comes from the source. The transient zone then compares to the cleansing area of the Data Warehouse. Data is transformed, wrangled, and massaged to suit the requirements of the analyzing users. As no one should access either the landing zone or the transient zone, the curated zone is where you're heading, and what you, as the developer, will allow your users to access. Other (similar) concepts talk about Bronze, Silver, and Gold areas, or use other terms. Just like in the Data Warehouse concept, the Data Lakes can include other additional zones. For example, zones for master data, user sandboxes, or logging areas for your routines. We can have several user- or group-related folder structures in the curated zone where data is aggregated, just like the Data Marts of the DWH.
Discovering caveats
So, yes, you can be flexible and agile. But in a Data Lake with Schema-On-Read, you'll need to decide which attributes you want to analyze or are needed by your machine learning models for training. This will, after all, force you to structure your sources and will therefore force you into a certain project life cycle. You will go through user stories, requirements analysis, structuring and development, versioning, and delivering artifacts.
If you're only analyzing tabular-oriented data, maybe it's worth checking if your hardware, your ETL-tool, and your company databases can scale to the needed volume.
This question, along with the nature of the source's data, complexity, and format, should be taken into account when you're deciding on the technology to use. If you need to process sound files, images, PDFs, or similar, then this is no job for a database. SQL does not offer language elements for this (you can add programming extensions to your database). But here, we have a definitive marker for the possible usage of technologies other than databases.
Once you have analyzed the so-called unstructured data, you will structure the results back into a tabular-oriented result set as an array (a tabular representation) of data. This needs to be delivered somehow to the recipients of your analysis. But how is that done in a Data Lake?
The typical reporting tools still require tables or table-like source data to work with. Often, they will import the source data into their internal database to be able to quickly answer reporting and dashboarding operations by their users. Experiences with these visualization tools have shown that it is not very performant to directly report from a vast number of files from a Data Lake. Data always needs to be condensed into digestible chunks for these purposes and should then also be stored in a suitable and accessible way.
Funnily enough, many of the services in the Hadoop ecosystem are equipped with similar functionality to the ones that databases have offered for ages now and are optimized for. Data Warehouse databases are more mature in many respects: when we look at Hive as the Data Warehouse service in Hadoop, for example, it still can't update the data and can't run queries with nested subqueries in Hive-QL. But over the years, it has been extended with all kinds of database-like functionality (views, indexes, and many more).
Important Note
The Data Lake approaches are still missing a fine-grained security model that can deliver centralized security mechanisms to control Row-Level-Security and Column-Level-Security over data while also allowing an easy-to-implement mechanism for Data Masking.