You're reading from Data Engineering with Google Cloud Platform A guide to leveling up as a data engineer by building a scalable data platform with Google Cloud

Product type Paperback

Published in Apr 2024

Publisher Packt

ISBN-13 9781835080115

Length 476 pages

Edition 2nd Edition

Languages

SQL

Tools

Google Cloud Platform

Concepts

Data Engineering

Author (1):

Adi Wijaya

View More author details

Table of Contents (19) Chapters

Preface

1. Part 1: Getting Started with Data Engineering with GCP FREE CHAPTER

2. Chapter 1: Fundamentals of Data Engineering

3. Chapter 2: Big Data Capabilities on GCP

4. Part 2: Build Solutions with GCP Components

5. Chapter 3: Building a Data Warehouse in BigQuery

6. Chapter 4: Building Workflows for Batch Data Loading Using Cloud Composer

7. Chapter 5: Building a Data Lake Using Dataproc

8. Chapter 6: Processing Streaming Data with Pub/Sub and Dataflow

9. Chapter 7: Visualizing Data to Make Data-Driven Decisions with Looker Studio

10. Chapter 8: Building Machine Learning Solutions on GCP

11. Part 3: Key Strategies for Architecting Top-Notch Solutions

12. Chapter 9: User and Project Management in GCP

13. Chapter 10: Data Governance in GCP

14. Chapter 11: Cost Strategy in GCP

15. Chapter 12: CI/CD on GCP for Data Engineers

16. Chapter 13: Boosting Your Confidence as a Data Engineer

17. Index

Why subscribe?

18. Other Books You May Enjoy

Building a Data Lake Using Dataproc

A data lake shares similarities with a data warehouse, yet its fundamental distinction lies in the nature of stored content. Unlike a data warehouse, a data lake is designed to manage extensive raw data, agnostic to its eventual value or purpose. This pivotal divergence reshapes approaches to data storage and retrieval within a data lake, setting it apart from the principles that we learned in Chapter 3, Building a Data Warehouse in BigQuery.

This chapter helps you understand how to build a data lake using Dataproc, which is a managed Hadoop cluster in Google Cloud Platform (GCP). But, more importantly, it helps you understand the key benefit of using a data lake in the cloud, which is allowing the use of ephemeral clusters.

Here is a high-level outline of this chapter: