What is Amazon Athena?
Amazon Athena is a query service that allows you to run standard SQL over data stored in a variety of sources and formats. As you will see later in this chapter, Athena is serverless, so there is no infrastructure to set up or manage. You simply pay $5 per TB scanned for the queries you run without needing to worry about idle resources or scaling.
Note
AWS has a habit of reducing prices over time. For the latest Athena pricing, please consult the Amazon Athena product page at https://aws.amazon.com/athena/pricing/?nc=sn&loc=3.
Athena is based on Presto (https://prestodb.io/), a distributed SQL engine that's open sourced by Facebook. It supports ANSI SQL, as well as Presto SQL features ranging from geospatial functions to rough query extensions, which allow you to run approximating queries, with statistically bound errors, over large datasets in only a fraction of the time. Athena's commitment to open source also provides an interesting avenue to avoid lock-in concerns because you always have the option to download and manage your own Presto deployment from GitHub. Of course, you will lose many of Athena's enhancements and must manage the infrastructure yourself, but you can take comfort in knowing you are not beholden to potentially punitive licensing agreements as you might be with other vendors.
While Athena's roots are open source, the team at AWS have added several enterprise features to the service, including the following:
- Federated Identity via SAML and Active Directory support
- Table, column, and even row-level access control via Lake Formation
- Workload classification and grouping for cost control via WorkGroups
- Automated regression testing to take the pain out of upgrades
Later chapters will cover these topics in greater detail. If you feel compelled to do so, you can use the table of contents to skip directly to those chapters and learn more.
Let's look at some use cases for Athena.
Use cases
Amazon Athena supports a wide range of use cases and we have personally used it for several different patterns. Thanks to Athena's ease of use, it is extremely common to leverage Athena for ad hoc analysis and data exploration.
Later in this book, you will use Athena from within a Jupyter notebook for machine learning. Similarly, many analysts enjoy using Athena directly from BI Tools such as Looker and Tableau, courtesy of Athena's JDBC driver. Athena's robust SQL dialect and asynchronous API model also enables application developers to build analytics right into their applications, enabling features that would not previously have been practical due to scale or operational burden. In many cases, you can replace RDBMS-driven features with Athena at a fraction of the cost and lower operational burden.
Another emerging use case for Athena is in the ETL space. While Athena advertises itself as being an engine that avoids the need for ETL by being able to query the data in place, as it is, we have seen the benefits of replacing existing or building new ETL pipelines using Athena where cost and capacity management are key factors. Athena will not necessarily achieve the same scale or performance as Spark, for example, but if your ETL jobs do not require multi-TB joins, you might find Athena to be an interesting option.
Separation of storage and compute
If you are new to serverless analytics, you may be wondering where your data is stored. Amazon Athena builds on the concept of Separation of Storage and Compute to decouple the computational resources (for example, CPU, memory, network) that do the heavy lifting of executing your SQL queries from the responsibility of keeping your data safe and available. In short, this means Athena itself does not store your data. Instead, you are free to choose from several data stores with customers increasingly pairing with DynamoDB to rapidly mutate data with S3 for their bulk data. With Athena, you can easily write a query that spans both data stores.
Amazon's Simple Storage Service, or S3 for short, is easily the most recommended data store to use with Athena. When Athena launched in 2016, S3 was the first data store it supported. Unsurprisingly, Athena has been optimized to take advantage of S3's unique ability to deliver exabyte scale and throughput while still providing eleven nines (99.999999999%) of durability. In addition to effortless scaling from a few gigabytes of data up to many petabytes, S3 offers some of the lowest prices for performance that you can find. Depending on your replication requirements, storing 1 GB of data for a month will cost you between $0.01 and $0.023. Even the most cost-efficient enterprise hard drives cost around $0.21 per GB before you add on redundancy, the power to run them, or a server and data center to house them. As with most AWS services, you should consult S3's pricing page (https://aws.amazon.com/s3/pricing/) for the latest details since AWS has cut their prices more than 70 times in the last decade.
Metastore
In addition to accessing the raw 1s and 0s that represent your data, Athena also requires metadata that helps its SQL engine understand how to interpret the data you have stored in S3 or elsewhere. This supplemental information helps Athena map collections of files, or objects in the case of S3, to SQL constructs such as tables, columns, and rows. The repository for this data, about your data, is often called a metastore. Athena works with Hive-compliant metastores, including AWS's Glue Data Catalog service. In later chapters, we will look at AWS Glue Data Catalog in more detail, as well as how you can attach Athena to your own metastore, even a homegrown one. For now, all you need to know is that Athena requires the use of a metastore to discover key attributes of the data you wish to query. The most common pieces of information that are kept in the Metastore include the following:
- A list of tables that exist
- The storage location of each table (for example, the S3 path or DynamoDB table name)
- The format of the files or objects that comprise the table (for example, CSV, Parquet, JSON)
- The column names and data types in each table (for example, inventory column is an integer, while revenue is a decimal (10,2))
Now that we have a good overview of Amazon Athena, let's look at how to use it in practice.