Chapter 15: Architecting Data Lakes for Real-World Scenarios and Edge Cases
We are now well versed in the concept of a data lake, a centralized repository that allows you to store all your structured and unstructured data at any scale. Since a data lake primarily focuses on storage, it does not require as much processing power as other methods (such as the data warehouse), making it easier, faster, and more cost-effective to scale up as data volumes grow.
The data lake is not just a repository – it requires a well-designed data architecture, along with proper planning and management. As it is driven by a data-based design, it helps you rapidly ingest raw data before any business requirements come into the picture. There are a variety of tools you can use for ingesting raw data into a data lake, including ETL tools such as Ab Initio, Informatica, and DataStage.
This chapter mainly covers practical examples of real-world data problems that exhibit certain bottlenecks and...