Solving Join problems involving big fact and big dimension tables using AWS Glue
Whether you are a data engineer, big data architect, or business analyst, one thing you need to do is scale your data processing and ETL batch workloads. In this section, we are going to talk about one of Glue’s Spark runtime optimization features: workload partitioning with bounded execution. This can help you handle join operations between a large fact table and a dimension table. We will also provide a hands-on tutorial to demonstrate the difference this feature can make concerning performance. This feature works in conjunction with AWS Glue bookmarks, which we discussed in Chapter 2, Introduction to Important AWS Glue Features. It can help you break down your complex and humongous workloads by bounding the execution of the respective Spark applications. In layman’s terms, you can partition your ETL workloads by putting a restriction in place for each of these independent workloads to...