Tips and tricks to optimize Amazon Athena queries
When raw data is ingested into the data lake, we can immediately create a table for that data in the AWS Glue Data Catalog (either using a Glue crawler or by running DDL statements with Athena to define the table). Once the table has been created, we can start exploring the table by using Amazon Athena to run SQL queries against the data.
However, raw data is often ingested in plaintext formats such as CSV or JSON. And while we can query the data in this format for ad hoc data exploration, if we need to run complex queries against large datasets, these raw formats are not efficient to query. There are also ways that we can optimize the SQL queries that we write to make the best use of the underlying Athena query engine, which we will review in this chapter.
By default, Amazon Athena’s cost is based on the amount of compressed data that is scanned to resolve your SQL query, so anything that can be done to reduce the...