Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Engineering with Google Cloud Platform

You're reading from   Data Engineering with Google Cloud Platform A practical guide to operationalizing scalable data analytics systems on GCP

Arrow left icon
Product type Paperback
Published in Mar 2022
Publisher Packt
ISBN-13 9781800561328
Length 440 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Adi Wijaya Adi Wijaya
Author Profile Icon Adi Wijaya
Adi Wijaya
Arrow right icon
View More author details
Toc

Table of Contents (17) Chapters Close

Preface 1. Section 1: Getting Started with Data Engineering with GCP
2. Chapter 1: Fundamentals of Data Engineering FREE CHAPTER 3. Chapter 2: Big Data Capabilities on GCP 4. Section 2: Building Solutions with GCP Components
5. Chapter 3: Building a Data Warehouse in BigQuery 6. Chapter 4: Building Orchestration for Batch Data Loading Using Cloud Composer 7. Chapter 5: Building a Data Lake Using Dataproc 8. Chapter 6: Processing Streaming Data with Pub/Sub and Dataflow 9. Chapter 7: Visualizing Data for Making Data-Driven Decisions with Data Studio 10. Chapter 8: Building Machine Learning Solutions on Google Cloud Platform 11. Section 3: Key Strategies for Architecting Top-Notch Data Pipelines
12. Chapter 9: User and Project Management in GCP 13. Chapter 10: Cost Strategy in GCP 14. Chapter 11: CI/CD on Google Cloud Platform for Data Engineers 15. Chapter 12: Boosting Your Confidence as a Data Engineer 16. Other Books You May Enjoy

Knowing the roles of a data engineer before starting

In the later chapters, we will spend much of our time doing practical exercises to understand the data engineering concepts. But before that, let's quickly take a look at the data engineer role. 

The job role is getting more and more popular now, but the terminology itself is relatively new compared to other job roles, such as accountant, lawyer, doctor, and many other well-established job roles. The impact is that sometimes there is still a debate of what a data engineer should and shouldn't do. 

For example, if you came to a hospital and met a doctor, you know for sure that the doctor would do the following:

  1. Examine your condition.
  2. Make a diagnosis of your health issues.
  3. Prescribe medicine. 

The doctor wouldn't do the following:

  1. Clean the hospital.
  2. Make the medicine.
  3. Manage hospital administration.

It's clear, and it applies to most well-established job roles. But how about data engineers?

This is just a very short list of examples of what data engineers should or shouldn't be responsible for:

  • Handle all big data infrastructures and software installation.
  • Handle application databases.
  • Design the data warehouse data model.
  • Analyze big data to transform raw data into meaningful information.
  • Create a data pipeline for machine learning.

The unclear condition is unavoidable since it's a new role and I believe it will be more and more established following the maturity of data science. In this section, let's try to understand what a data engineer is and despite many combinations of responsibilities, what you should focus on as a data engineer.

Data engineer versus data scientist

A data engineer is someone who designs and builds data pipelines. 

The definition is that simple, but I found out that the question about the different between a data engineer versus a data scientist is still one of the most frequently asked questions when someone wants to start their data career. The hype of data scientists on the internet is one of the drivers; for example, up until today people still like to quote the following:

"Data scientist: the sexiest job of the 21st Century"

– Harvard Business Review

The data scientist role was originally invented to refer to groups of people who are highly curious and able to utilize big data technologies for business purposes back in 2008. But since the technologies are maturing and becoming more complex, people start to realize that it's too much. It's very rare for a company to hire someone who knows how to do all of the following: 

  • How to handle big data infrastructure
  • Properly design and build ETL pipelines
  • Train machine learning models 
  • Understand deeply about the company's business 

Not that it's impossible, some people do have this knowledge, but from a company's point of view, it's not practical.

These days, for better focus and scalability, the data scientist role can be split into many different roles, for example, data analyst, machine learning engineer, and business analyst. But one of the most popular and realized to be very important roles is data engineer.

The focus of data engineers

Let's map the data engineer role to our data life cycle diagram Figure 1.5 from the previous section. 

In the diagram, I added two underlying components:

  • Job Orchestrator: Design and build a job dependency and scheduler that runs data movement from upstream to downstream.
  • Infrastructure: Provision the required data infrastructure to run the data pipelines.

And on each step, I added numbers from 1 to 3. The numbers will help you to identify which components are the data engineer's main responsibility. This diagram works together with Figure 1.7, a data engineer-focused diagram to map the numbering. First, let's check this data life cycle diagram that we discussed before with the numbering on it:

Figure 1.6 – Data life cycle flows with focus numbering

Figure 1.6 – Data life cycle flows with focus numbering

After seeing the numbering on the data life cycle, check this diagram that illustrates the focus points of a data engineer:

Figure 1.7 – Data engineer-focused diagram

Figure 1.7 – Data engineer-focused diagram

The diagram shows the distribution of the knowledge area from the end-to-end data life cycle. At the center of the diagram (number 3) are the jobs that are the key focus of data engineers, and I will call it the core.  

Those numbered 2 are the good to have area. For example, it's still common in small organizations that data engineers need to build a data mart for business users. 

Important Note

Designing and building a data mart is not as simple as creating tables in a database. Someone who builds a data mart needs to be able to talk to business people and gather requirements to serve tables from a business perspective, which is one of the reasons it's not part of the core.

While how to collect data to a data lake is part of the data engineer's responsibility, exporting data from operational application databases is often done by the application development team, for example, dumping MySQL tables as CSV in staging storage.

Those numbered 1 are the good to know area. For example, it's rare that a data engineer needs to be responsible for building application databases, developing machine learning models, maintaining infrastructure, and creating dashboards. It is possible, but less likely. The discipline needs knowledge that is a little bit too far from the core.

After learning about the three focus areas, now let's retrospect our understanding and vision about data engineers. Study the diagram carefully and answer these questions.

  • What are your current focus areas as an individual?
  • What are your current job's role focus areas (or if you are a student, your study areas)?
  • What is your future goal in the data science world?

Depending on your individual answers, check with the diagram – do you have all the necessary skills at the core? Does your current job give you experience in the core? Are you excited if you could master all subjects at the core in the near future?

From my experience, what is important to data engineers is the core. Even though there are a variety of data engineers' expectations, responsibilities, and job descriptions in the market, if you are new to the role, then the most important thing is to understand what the core of a data engineer is. 

The diagram gives you guidance on what type of data engineers you are or will be. The closer you are to the core, the more of a data engineer you are. You are on the right track and in the right environment to be a good data engineer. 

In scenarios where you are at the core, plus other areas beside it, then you are closer to a full-stack data expert; as long as you have a strong core, if you are able to expand your expertise to the good to have and good to know areas, you will have a good advantage in your data engineering career. But if you focus on other non-core areas, I suggest you find a way to master the core first. 

In this section, we learned about the role of a data engineer. If you are not familiar with the cores, the next section will be your guidance to the fundamental concepts in data engineering. 

You have been reading a chapter from
Data Engineering with Google Cloud Platform
Published in: Mar 2022
Publisher: Packt
ISBN-13: 9781800561328
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime