You're reading from Serverless ETL and Analytics with AWS Glue Your comprehensive reference guide to learning about AWS Glue and its features

Product type Paperback

Published in Aug 2022

Publisher Packt

ISBN-13 9781800564985

Length 434 pages

Edition 1st Edition

Languages

Python

Tools

AWS

Concepts

Data Analysis

Authors (6):

Vishal Pathak

Ishan Gaur

Tomohiro Tanaka

Albert Quiroga

Subramanya Vajiraya

Noritaka Sekiyama

+2 more

View More author details

Table of Contents (20) Chapters

Preface

1. Section 1 – Introduction, Concepts, and the Basics of AWS Glue

2. Chapter 1: Data Management – Introduction and Concepts FREE CHAPTER

3. Chapter 2: Introduction to Important AWS Glue Features

4. Chapter 3: Data Ingestion

5. Section 2 – Data Preparation, Management, and Security

6. Chapter 4: Data Preparation

7. Chapter 5: Data Layouts

8. Chapter 6: Data Management

9. Chapter 7: Metadata Management

10. Chapter 8: Data Security

11. Chapter 9: Data Sharing

12. Chapter 10: Data Pipeline Management

13. Section 3 – Tuning, Monitoring, Data Lake Common Scenarios, and Interesting Edge Cases

14. Chapter 11: Monitoring

15. Chapter 12: Tuning, Debugging, and Troubleshooting

16. Chapter 13: Data Analysis

17. Chapter 14: Machine Learning Integration

18. Chapter 15: Architecting Data Lakes for Real-World Scenarios and Edge Cases

19. Other Books You May Enjoy

Data lakes

A data lake can be defined as a centralized repository that allows you to store all structured and unstructured data at any scale. With today’s hyper scalers providing cheap and durable storage, it is now possible for organizations to store all of their data in the cloud without significant cost implications. Data lakes are broken down into layers or zones.

In the first layer of the data lake, data is generally stored as-is. This reduces the entry barrier and enables organizations to move all of their data to the “lake” without significantly increasing development or maintenance costs. Because the first layer of the data lake is an as-is copy of the data, organizations can use an automated configuration-based pipeline to create newer sources.

Organizations usually pick a replication tool such as AWS Data Migration Service (AWS DMS) to bring the data into the data lake. While AWS DMS involves taking care of the replication infrastructure, it is mostly a hands-off mechanism for hydrating the lake. Organizations may also use a push mechanism to FTP to transfer the files to an AWS Simple Storage Service (S3)-based data lake using AWS Transfer Family.

Data from the first layer is compressed and partitioned, and audited columns are added during data preparation so that they can be used by downstream systems more effectively. Having all the data in the data lake enables data analysts to do the initial discovery to find out the value of combining data from various sources. If the value is discovered, then necessary transformations are applied in an ETL pipeline so that the target is hydrated with newer data periodically or through a streaming arrangement. These automated transformations are then loaded into the final layer of a data lake and used for user consumption.

You're reading from Serverless ETL and Analytics with AWS Glue Your comprehensive reference guide to learning about AWS Glue and its features

Table of Contents (20) Chapters

Data lakes

Authors (6)

Personalised recommendations for you