Athena supports both Data Definition Language (DDL) and Data Manipulation Language (DML) queries. Queries where you SELECT
data from a table are a common example of DML queries. Our first meaningful Athena query will be a DDL query that creates, or defines, our NYC Taxis data table:
- Let's begin by ensuring our AWS account and IAM user/role are ready to use Athena. To do that, navigate to the Athena query editor in the AWS Console: https://console.aws.amazon.com/athena/home.
Be sure to use the same region that you uploaded your data and created your database in.
- If this is your first time using Athena, you will likely be met by a screen like the following. Luckily, Athena is telling us that "Before you run your first query, you need to set up a query result location in Amazon S3…". Since Athena writes the results of all queries to S3, even DDL queries, we will need to configure this setting before we can proceed. To do so, click on the highlighted text in the AWS Console that's shown in the following screenshot:
Figure 1.1 – The prompt for setting the query result's location upon your first visit to Athena
- After clicking on the modal's link, you will see the following prompt so that you can set your query result's location. You can use the same S3 bucket we used to upload our sample data, with
results
being used as the name of the folder that Athena will write query results to within that bucket. Be sure your location ends with a "/" to avoid errors:
Figure 1.2 – Athena's settings prompt for the query result's location
Next, let's learn how to create a table.
Creating your first table
It is now time to run our first Athena query. The following DDL query asks Athena to create a new table called nyc_taxi
in the packt_serverless_analytics
database, which is stored in the AWS Glue Data Catalog. The query also specifies the schema (columns), file format, and storage location of the table. For now, the other nuances of this create query are unimportant. You may find it easier to copy create table
from the create_nyc_taxi.sql
(http://bit.ly/3mXj3K0) file in the chapter_1
folder of this book's GitHub repository. Paste it into Athena's query editor, change LOCATION
so that it matches your bucket name, and click Run query. It should complete in a few seconds:
CREATE EXTERNAL TABLE 'packt_serverless_analytics'.'nyc_taxi'(
'vendorid' bigint,
'tpep_pickup_datetime' string,
'tpep_dropoff_datetime' string,
'passenger_count' bigint,
'trip_distance' double,
'ratecodeid' bigint,
'store_and_fwd_flag' string,
'pulocationid' bigint,
'dolocationid' bigint,
'payment_type' bigint,
'fare_amount' double,
'extra' double,
'mta_tax' double,
'tip_amount' double,
'tolls_amount' double,
'improvement_surcharge' double,
'total_amount' double,
'congestion_surcharge' double)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://<YOUR_BUCKET_NAME>/tables/nyc_taxi/'
TBLPROPERTIES (
'areColumnsQuoted'='false',
'columnsOrdered'='true',
'compressionType'='gzip',
'delimiter'=',',
'skip.header.line.count'='1',
'typeOfData'='file')
Once your table creation DDL query completes, the left navigation pane of the Athena console will refresh with the definition of your new table. If you have other databases and tables, you may need to choose your database from the dropdown before your new table will appear.
Figure 1.3 – Athena's Database navigator will show the schema of your newly created table
At this point, the significance of the query we just ran may not be entirely apparent, but rest assured we will go deeper into why serverless DDL queries are a powerful thing. Oh, and did we mention that Athena does not charge for DDL queries?
Running your first analytics queries
When working with a new or unfamiliar set of data, it can be helpful to view a sample of the rows before exploring the dataset in more meaningful ways. This allows you to understand the schema of your dataset, including verifying that the schema (for example, column names) match the values and types. There are a few ways to do this, including the following limit query:
SELECT * from packt_serverless_analytics.nyc_taxi limit 100
This works fine in most cases, but we can do better. Many query engines, Athena included, will end up returning all 100 rows requested in the preceding query from the same S3 object. If your dataset contains many objects or files, you are getting an extremely narrow view of the table. For that reason, I prefer using the following query to view data from a broader portion of the dataset:
SELECT *
FROM packt_serverless_analytics.nyc_taxi TABLESAMPLE BERNOULLI (1)
limit 100
This query is like the earlier limit query but uses Athena's TABLESAMPLE
feature to obtain our 100 requested rows using BERNOULLI
sampling. When a table is sampled using the Bernoulli method, all the objects of the table may be scanned as opposed to likely stopping after the first object. This is because the probability of a row being included in the result is independent of any other row reducing the significance of the object scan order. In the following screenshot, we can see some of the rows that were returned using TABLESAMPLE
with the BERNOULLI
method:
Figure 1.4 – Results of executing TAMPLESAMPLE against our nyc_taxi table
While that query allowed us to confirm that Athena can indeed access our data and that the schema appears to match the data itself, we have not extracted any real insights from the data. For this, we will run our first real analytics query by generating a histogram of ride durations and distances. Our goal here is to learn how much time people are typically spending in taxis, but we'll also be able to gain insights into the quality of our data. The following query uses Athena's numeric_histogram
function to approximate the distribution with 10 buckets according to the difference between tpep_pickup_datetime
and tpep_dropoff_datetime
. Since the dataset stores datetimes as strings, we are using the date_parse
function to convert the values into actual timestamps that we can then use with Athena's date_diff
function to generate the ride durations as minutes. Lastly, the query uses a CROSS JOIN
with UNEST
to turn the histogram into rows and columns. Normally, the numeric_histogram
function returns a map containing the histogram, but this can be difficult to read. UNEST
helps us turn it into a more intuitive tabular format. Do not worry about remembering all these functions and SQL techniques right now. Athena frequently adds new capabilities, and you can always consult a reference.
You can copy the following code from GitHub at http://bit.ly/2Jm6o5v:
SELECT ride_minutes, number_rides
FROM (SELECT numeric_histogram(10,
date_diff('minute',
date_parse(tpep_pickup_datetime,'%Y-%m-%d %H:%i:%s'),
date_parse(tpep_dropoff_datetime, '%Y-%m-%d %H:%i:%s')
)
)
FROM packt_serverless_analytics.nyc_taxi ) AS x (ride_histogram)
CROSS JOIN
UNNEST(ride_histogram) AS t (ride_minutes, number_rides);
Once you run the query, the results will look as follows. You can experiment with the number of buckets that are generated by adjusting the parameters of the numeric_histogram
function. Generating 100 or even 1,000 buckets can uncover patterns that were hidden with fewer buckets. Even with just 10 buckets, we can already see a strong correlation between the distance and the number of rides. I was surprised to see that such a large portion of the yellow cab rides lasted less than 7 minutes. From this query, we can also see some likely data quality issues in the dataset. Unless one of the June 2020 rides happened in a time-traveling DeLorean, we likely have an erroneous record. Less obvious is the fact that several hundred rides claim to have lasted longer than 24 hours:
Figure 1.5 – Ride duration histogram results
Let's try one more histogram query, but this time, we will target the trip distance of the rides that took less than 7 minutes. The following code block contains the modified histogram query you can run to understand that bucket of rides. You can download it from GitHub at http://bit.ly/3hkggJl:
SELECT trip_distance, number_rides
FROM
(SELECT numeric_histogram(5,trip_distance)
FROM packt_serverless_analytics.nyc_taxi
WHERE date_diff('minute',
date_parse(tpep_pickup_datetime,'%Y-%m-%d %H:%i:%s'),
date_parse(tpep_dropoff_datetime, '%Y-%m-%d %H:%i:%s')
) <= 6.328061
) AS x (ride_histogram)
CROSS JOIN UNNEST(ride_histogram) AS t (trip_distance , number_rides);
Considering that the average person can walk a mile in 15 minutes, New Yorkers must be in a serious hurry to opt for taxi rides instead of a 15-minute walk!
Figure 1.6 – Ride distance histogram results
With that, we've been through the basics of AWS Athena. Let's conclude by providing a recap of what we've learned.