Summary
In this chapter, we introduced one of the most important services in the AWS stack – AWS Glue. We also learned about the high-level components that comprise AWS Glue such as the AWS Glue console, the AWS Glue Data Catalog, AWS Glue crawlers, and AWS Glue code generators. We then learned how everything is connected and how it can be used. Finally, we spent some time learning about recommended best practices when architecting and implementing AWS Glue.
In this chapter, we reviewed how we can choose the right worker type when launching an AWS Glue job. We learned how to optimize our file size during file splitting. We saw what can cause Yarn to run out of memory and what can be done to avoid this problem. We learned how the Apache Spark UI can be leveraged for troubleshooting. We were presented with definitions of data partitioning and predicate pushdown, and why they're important, along with other best practices and techniques.
In the next chapter, we will learn...