Building a Data Lake Using Dataproc
A data lake shares similarities with a data warehouse, yet its fundamental distinction lies in the nature of stored content. Unlike a data warehouse, a data lake is designed to manage extensive raw data, agnostic to its eventual value or purpose. This pivotal divergence reshapes approaches to data storage and retrieval within a data lake, setting it apart from the principles that we learned in Chapter 3, Building a Data Warehouse in BigQuery.
This chapter helps you understand how to build a data lake using Dataproc, which is a managed Hadoop cluster in Google Cloud Platform (GCP). But, more importantly, it helps you understand the key benefit of using a data lake in the cloud, which is allowing the use of ephemeral clusters.
Here is a high-level outline of this chapter:
- Introduction to Dataproc
- Exercise – Building a data lake on a Dataproc cluster
- Exercise – Creating and running jobs on a Dataproc cluster
- Understanding...