You're reading from Data Engineering with AWS Learn how to design and build cloud-based data transformation pipelines using AWS

Product type Paperback

Published in Dec 2021

Publisher Packt

ISBN-13 9781800560413

Length 482 pages

Edition 1st Edition

Languages

Python

Tools

AWS

Concepts

Data Analysis

Author (1):

Gareth Eagar

View More author details

Table of Contents (19) Chapters

Preface

1. Section 1: AWS Data Engineering Concepts and Trends

2. Chapter 1: An Introduction to Data Engineering FREE CHAPTER

3. Chapter 2: Data Management Architectures for Analytics

4. Chapter 3: The AWS Data Engineer's Toolkit

5. Chapter 4: Data Cataloging, Security, and Governance

6. Section 2: Architecting and Implementing Data Lakes and Data Lake Houses

7. Chapter 5: Architecting Data Engineering Pipelines

8. Chapter 6: Ingesting Batch and Streaming Data

9. Chapter 7: Transforming Data to Optimize for Analytics

10. Chapter 8: Identifying and Enabling Data Consumers

11. Chapter 9: Loading Data into a Data Mart

12. Chapter 10: Orchestrating the Data Pipeline

13. Section 3: The Bigger Picture: Data Analytics, Data Visualization, and Machine Learning

14. Chapter 11: Ad Hoc Queries with Amazon Athena

15. Chapter 12: Visualizing Data with Amazon QuickSight

16. Chapter 13: Enabling Artificial Intelligence and Machine Learning

17. Chapter 14: Wrapping Up the First Part of Your Learning Journey

18. Other Books You May Enjoy

Data engineers – the big data enablers

Amid the increasing recognition of data as a valuable corporate asset and the introduction of new technologies to store and process vast amounts of data, there has been an increase in the opportunities and roles available for data-related careers.

Let's look at a sample use case, where a sales manager for a consumer goods organization wants to better understand which alternative products a customer considers before purchasing their product. In addition, they also want to have a better way of predicting product demand by category based on external factors, such as the expected weather.

Achieving the desired outcomes as specified by the sales manager will require bringing in data from multiple internal and external sources. Datasets that could be relevant to this scenario may include the following:

Customer, product, and order relational databases
Web server logs from the consumer-facing storefront
Third-party sales data from online marketplaces where relevant products are sold (such as Amazon.com)
Other relevant third-party datasets that may influence sales (for example, weather-related data)

Multiple teams would need to be involved in the project, with the following three roles playing a primary part in implementing the required solution.

Understanding the role of the data engineer

The role of a data engineer is to do the following:

Design, implement, and maintain the pipelines that enable the ingestion of raw data into a storage platform.
Transform that data to be optimized for analytics.
Make that data available for various data consumers using their tool of choice.

In our scenario, the data engineer will first need to design the pipelines that ingest data from the various internal and external sources. To achieve this, they will use a variety of tools (more on that in future chapters), depending on the source system and whether it will be scheduled batch ingestion or real-time streaming ingestion.

The data engineer is also responsible for transforming the raw input datasets to optimize them for analytics, using various techniques (as discussed later in this book). The data engineer must also create processes to verify the quality of data, add metadata about the data to a data catalog, and manage the life cycle of code related to data transformation.

Finally, the data engineer may need to assist in integrating various data consumption tools with the transformed data, enabling data analysts and data scientists to use their preferred tools to draw insights from the data.

The data engineer uses tools such as Apache Spark, Apache Kafka, and Presto, as well as other commercially available products, to build the data pipeline and optimize data for analytics.

The data engineer is much like a civil engineer for a new residential development. The civil engineer is responsible for designing and building the roads, bridges, train stations, and so on to enable commuters to easily commute in and out of the development, while the data engineer is responsible for designing and building the infrastructure required to bring data into a central source and for optimizing the data for use by various data consumers.

Understanding the role of the data scientist

The role of a data scientist is to draw complex insights and make predictions based on various datasets, using machine learning and artificial intelligence. The data scientist will combine a number of skills, including computer science, statistics, analytics, and math, in order to help an organization answer complex questions and make informed decisions using data.

Data scientists need to understand the raw data and know how to use that data to develop and train complex machine learning models that will help recognize patterns in the data and predict future trends. In our scenario, the data scientist may build a machine learning model that uses past sales data, correlated with weather information for each day in the reporting period. They can then design and train this model to help business users get predictions on the likely top-selling categories for future dates based on the expected weather forecast.

Where the data engineer is like a civil engineer building infrastructure for a new development, the data scientist is developing cars, airplanes, and other forms of transport used to move in and out of the development. Data scientists create machine learning models that enable data consumers and business analysts to draw new insights and predictions from data.

Understanding the role of the data analyst

The role of a data analyst is to examine and combine multiple datasets in order to help a business understand trends in the data and to make more informed business decisions. While a data scientist develops models that make future predictions or identifies non-obvious patterns in data, the data analyst works with well-structured and modeled data to understand current conditions and to highlight recent patterns from the data.

A data analyst may answer questions such as which menu item sold best in different geographic regions over the past month, or which medical procedure had the best outcome for patients of different ages. These insights help an organization make better decisions for the future.

In our scenario, the data analyst may run complex queries against the different datasets that are available (such as an orders database or web server logs), joining together subsets of data from each source to gain new insights. For example, the data analyst may create a report highlighting which alternate products are most often browsed by a customer before a specific product is purchased. The data analyst may also make use of advanced machine learning models developed by the data scientists to gain further valuable insights.

Where the data engineer is like a civil engineer building infrastructure, and the data scientist is developing means of transportation, the data analyst is like a skilled pilot, using their expertise to get users to their end destination.

Understanding other common data-related roles

Organizations may have other role titles and job descriptions for data-related positions, but generally, these will be a subset of the roles described in the preceding sections.

For example, a big data architect could be a subset of the data engineer role, focused on designing the architecture for big data pipelines, but not building the actual pipelines. Or, a data visualization developer may be focused on building out visualizations using business intelligence tools, but this is effectively a subset of the data analyst role.

Larger organizations tend to have more focused job roles, while in a smaller organization a single person may take on the role of data engineer, data scientist, and data analyst.

In this book, we will focus on the role of the data engineer, and dive deep into how a data engineer is able to build complex data pipelines using the power of cloud computing services. Let's now look at how cloud computing has simplified how organizations are able to build and scale out big data processing solutions.