The Nature of Data Warehouses and Data Lakes
Data warehouse (DW or DWH) is a central repository of current and historical data that has been integrated from one or more disparate sources. The DWH (also referred to as an enterprise data warehouse (EDW)) is a system that is used for data analysis and reporting. It is usually considered the core of an enterprise business intelligence strategy.
Data stored in a DWH comes from multiple systems, including operational systems (such as CRM systems). The data may need to undergo a set of data cleansing activities before it can be uploaded into the DWH to ensure data quality.
Some DWH tools have built-in extract, transform, and load (ETL) capabilities, while others rely on external third-party tools (you will cover ETL tools and other integration middleware in Chapter 3, Core Architectural Concepts: Integration and Cryptography). This ETL capability will ensure that the ingested data has a specific quality and structure. Data might be staged in a specific staging area before it is loaded into the DWH. You need to become familiar with the names of some popular data warehousing solutions, such as Amazon Redshift, which is a fully managed, cloud-based data warehouse.
A data lake is a system or repository of data that is stored in an unstructured way. Data is held in its rawest form; it is not processed, modified, or analyzed. Data lakes accept and store all kinds of data from all sources. Structured, semi-structured, processed, and transformed data can also be stored in data lakes (such as XML data or data coming from databases). The data that is gathered will be used for reporting, visualization, business intelligence, machine learning, and advanced analytics. You need to become familiar with the names of some popular data lake solutions, such as AWS Lake Formation.
You also need to know which solution you should propose and the reason. If you need to offer a platform that can provide historical trending reports, deep data analysis capabilities, and the ability to report on a massive amount of data, then a DWH is more suitable for your use case. Keep in mind that data that has been extracted from Salesforce is mostly in a structured format (files are an exception).
In many review board scenarios, you will come across the need to archive the platform data. You will discover the options that are available in Chapter 7, Designing a Scalable Salesforce Data Architecture, but at the moment, all you need to know is that one of these options is a DWH.
Data lakes could be the right solution if you need to store structured and unstructured data in one place to facilitate tasks such as machine learning or advanced analytics.
Another common requirement in today’s enterprise solutions is document management systems. You will learn more about these in the next section.