You're reading from Data Engineering with Google Cloud Platform A practical guide to operationalizing scalable data analytics systems on GCP

Product type Paperback

Published in Mar 2022

Publisher Packt

ISBN-13 9781800561328

Length 440 pages

Edition 1st Edition

Languages

Python

Tools

Google Cloud Platform

Concepts

Data Analysis

Author (1):

Adi Wijaya

View More author details

Table of Contents (17) Chapters

Preface

1. Section 1: Getting Started with Data Engineering with GCP

2. Chapter 1: Fundamentals of Data Engineering FREE CHAPTER

3. Chapter 2: Big Data Capabilities on GCP

4. Section 2: Building Solutions with GCP Components

5. Chapter 3: Building a Data Warehouse in BigQuery

6. Chapter 4: Building Orchestration for Batch Data Loading Using Cloud Composer

7. Chapter 5: Building a Data Lake Using Dataproc

8. Chapter 6: Processing Streaming Data with Pub/Sub and Dataflow

9. Chapter 7: Visualizing Data for Making Data-Driven Decisions with Data Studio

10. Chapter 8: Building Machine Learning Solutions on Google Cloud Platform

11. Section 3: Key Strategies for Architecting Top-Notch Data Pipelines

12. Chapter 9: User and Project Management in GCP

13. Chapter 10: Cost Strategy in GCP

14. Chapter 11: CI/CD on Google Cloud Platform for Data Engineers

15. Chapter 12: Boosting Your Confidence as a Data Engineer

16. Other Books You May Enjoy

Foundational concepts for data engineering

Even though there are many data engineering concepts that we will learn throughout the book by using Google Cloud Platform (GCP), there are some concepts that are basic and you need to know as data engineers. In my experience interviewing in data companies, I found out that these foundational concepts are often asked to test how much you know about data engineering. Take the following examples:

What is Extract-Transform-Load (ETL)?
What's the difference between ETL and Extract-Load-Transform (ELT)?
What is big data?
How do you handle large volumes of data?

These questions are very common, yet very important to deeply understand the concepts since it may affect our decisions on architecting our data life cycles.

ETL concept in data engineering

ETL is the key foundation of data engineering. All things in the data life cycle are ETL; any part that happens from upstream to downstream is ETL. Let's take a look at the upstream to downstream flows that has an ETL process in between here:

Figure 1.8 – ETL illustration

ETL consists of three actual steps that you need in order to move your data:

What is extract? This is the step to get the data from the upstream system. For example, if the upstream system is an RDBMS, then the extract step will be dumping or exporting data from the RDBMS.
What is transform? This is the step to apply any transformation to the extracted data. For example, the file from the RDBMS needs to be joined with a static CSV file, then the transform step will process the extracted data, load the CSV file, and finally, join both information together in an intermediary system.
What is load? This is the step to put the transformed data to the downstream system. For example, if the downstream system is BigQuery, then the load step will call BigQuery load job to store the data into BigQuery's table.

Back in Figure 1.5, Data life cycle diagrams, each of the individual steps may have a different ETL process. For example, at the application database to data lake step, the upstream is the application database and the data lake is the downstream. But at the data lake to data warehouse step, the data lake becomes the upstream and the data warehouse as its downstream. So, you need to think about how you want to do the ETL process in every data life cycle step.

The difference between ETL and ELT

ETL is extract, transform, load and ELT is extract, load, transform. From the acronym itself, the difference between ETL and ELT is only the ordering of the letters T and L. Should you transform first and then load the data to the downstream or load the data to the downstream first and then transform the data inside the downstream system?

Figure 1.9 – Extract load transform

Easy! What's the big deal?

Even though it's a very simple difference in the acronym, deciding on the method can really affect your choice of technology products, system performance, scalability, and cost. For example, not all downstream systems are powerful enough to transform large volumes of data; in this case, ETL is preferred since using the ELT pattern will introduce issues in your downstream system.

In other cases, the downstream system is a lot more powerful compared to any intermediary system, so you want to choose the ELT pattern. This mostly happens after the data lake era where the downstream are products such as Hadoop, BigQuery, or other scalable data processing products. But this is not the absolute answer; depending on your available choice of technology, you may change your ETL versus ELT strategy.

You will understand this better after running through the content of this book with a lot of ETL and ELT examples, but at this point, the important thing to keep in mind is, as a data engineer, you have two options of where to transform your data: in an intermediary system or in the target system.

What is NOT big data?

After learning about ETL and ELT, the other most common terminology is big data. Since big data is still one of the highly correlated concepts close to data engineering, it is important how you interpret the terminology as a data engineer. Note that the word big data itself refers to two different subjects:

The data itself is big.
The big data technology.

With so much hype in the media about the words, both in the context of data is getting bigger and big data technology, I don't think I need to tell you the definition of the word big data. Instead, I will focus on eliminating the non-relevant definitions of big data for data engineers. Here are some definitions in media or from people that I have met personally:

All people already use social media, the data in social media is huge, and we can use the social media data for our organization. That's big data.
My company doesn't have data. Big data is a way to use data from the public internet to start my data journey. That's big data.
The five Vs of data: volume, variety, velocity, veracity, and value. That's big data.

All the preceding definitions are correct but not really helpful to us as data engineers. So instead of seeing big data as general use cases, we need to focus on the how questions; think about what actually works under the hood. Take the following examples:

How do you store 1 PB of data in storage, while the size of common hard drives is in TBs?
How do you average a list of numbers, when the data is stored in multiple computers?
How can you continuously extract data from the upstream system and do aggregation as a streaming process?

These kinds of questions are what are important for data engineers. Data engineers need to know when a condition (the data itself is big) should be handled using big data or non-big data technology.

A quick look at how big data technologies store data

Knowing that answering the how question is what is important to understanding big data, the first question we need to answer is how does it actually store the data? What makes it different from non-big data storage?

The word big in big data is relative. For example, say you analyze Twitter data and then download the data as JSON files with a size of 5 GB, and your laptop storage is 1 TB with 16 GB memory.

I don't think that's big data. But if the Twitter data is 5 PB, then it becomes big data because you need a special way to store it and a special way to process it. So, the key is not about whether it is social media data or not, or unstructured or not, which sometimes many people still get confused by. It's more about the size of the data relative to your system.

Big data technology needs to be able to distribute the data in multiple servers. The common terminology for multiple servers working together is a cluster. I'll give an illustration to show you how a very large file can be distributed into multiple chunks of file parts on multiple machines:

Figure 1.10 – Distributed filesystem

In a distributed filesystem, a large file will be split into multiple small parts. In the preceding example, it is split into nine parts, and each file is a small 128 MB file. Then, the multiple file parts are distributed into three machines randomly. On top of the file parts, there will be metadata to store information about how the file parts formed the original file, for example, a large file is a combination of file part 1 located in machine 1, file part 2 located in machine 2, and more.

The distributed parts can be stored in any format that isn't necessarily a file format; for example, it can be in the form of data blocks, byte arrays in memory, or some other data format. But for simplicity, what you need to be aware of is that in a big data system, data can be stored in multiple machines and in order to optimize performance, sometimes you need to think about how you want to distribute the parts.

After we know data can be split into small parts on different machines, it leads to further questions:

How do I process the files?
What if I want to aggregate some numbers from the files?
How does each part know the records value from other parts while it is stored in different machines?

There are many approaches to answer these three questions. But one of the most famous concepts is MapReduce.

A quick look at how to process multiple files using MapReduce

Historically speaking, MapReduce is a framework that was published as a white paper by Google and is widely used in the Hadoop ecosystem. There is an actual open source project called MapReduce mainly written in Java that still has a large user base, but slowly people have started to change to other distributed processing engine alternatives, such as Spark, Tez, and Dataflow. But MapReduce as a concept itself is still relevant regardless of the technology.

In a short summary, the word MapReduce can refer to two definitions:

MapReduce as a technology
MapReduce as a concept

What is important for us to understand is MapReduce as a concept. MapReduce is a combination of two words: map and reduce.

Let's take a look at an example, if you have a file that's divided into two file parts:

Figure 1.11 – File parts

Each of the parts contains one or more words, which in this example are fruit. The file parts are stored on different machines. So, each machine will have these three file parts:

File Part 1 contains two words: Banana and Apple.
File Part 2 contains three words: Melon, Apple, and Banana.
File Part 3 contains one word: Apple.

How can you write a program to calculate a word count that produces these results?

Apple = 3
Banana = 2
Melon = 1

Since the file parts are separated in different machines, we cannot just count the words directly. We need MapReduce. Let's take a look at the following diagram, where file parts are mapped, shuffled, and lastly reduced to get the final result:

Figure 1.12 – MapReduce step diagram

There are four main steps in the diagram:

Map: Add to each individual record a static value of 1. This will transform the word into a key-value pair when the value is always 1.
Shuffle: At this point, we need to move the fruit words between machines. We want to group each word and store it in the same machine for each group.
Reduce: Because each fruit group is already in the same machine, we can count them together. The Reduce step will sum up the static value 1 to produce the count results.
Result: Store the final results back in the single machine.

The key idea here is to process any possible process in a distributed manner. Looking back at the diagram, you can imagine each box on each step is a different machine.

Each step, Map, Shuffle, and Reduce, always maintains three parallel boxes. What does this mean? It means that the processes happened in parallel on three machines. This paradigm is different from calculating all processes in a single machine. For example, we can simply download all the file parts into a pandas DataFrame in Python and do a count using the pandas DataFrame. In this case, the process will happen in one machine.

MapReduce is a complex concept. The concept is explained in a 13-page-long document by Google. You can find the document easily on the public internet. In this book, I haven't added much deeper explanation about MapReduce. In most cases, you don't need to really think about it; for example, if in a later chapter you use BigQuery to process 1 PB of data, you will only need to run a SQL query and BigQuery will process it in a distributed manner in the background.

As a matter of fact, all technologies in GCP that we will use in this book are highly scalable and without question able to handle big data out of the box. But understanding the underlying concepts helps you as a data engineer in many ways, for example, choosing the right technologies, designing data pipeline architecture, troubleshooting, and improving performance.