Reducing time on read operations using AWS Glue grouping
Let’s assume you have an edge use case where you have over 1 billion rows in one of your dimension table data sources available in Amazon S3 and that you have written some ETL code in a Glue job. This code reads millions of small files with billions of rows with a standard Glue worker, does some file conversion, and writes the files back to S3. In this section, you will learn how to deal with expensive Spark read operations, especially while reading the data from large-dimension tables with AWS Glue.
As we know, Glue manages provisions and manages the resources that are required to perform ETL for you. That being said, when you encounter OOM exceptions thrown by the Spark driver, we need to understand how Spark works to resolve them. Once the Glue job is executed, the Glue console provides you with the ETL metrics and memory profiles for each job run you execute, which helps you identify job abnormalities and performance...