You're reading from Serverless Analytics with Amazon Athena Query structured, unstructured, or semi-structured data in seconds without setting up any infrastructure

Product type Paperback

Published in Nov 2021

Publisher Packt

ISBN-13 9781800562349

Length 438 pages

Edition 1st Edition

Languages

Python

Tools

Amazon Athena

Concepts

Data Processing

Authors (3):

Aaron Wishnick

Mert Turkay Hocanin

Anthony Virtuoso

View More author details

Table of Contents (20) Chapters

Preface

1. Section 1: Fundamentals Of Amazon Athena

2. Chapter 1: Your First Query FREE CHAPTER

3. Chapter 2: Introduction to Amazon Athena

4. Chapter 3: Key Features, Query Types, and Functions

5. Section 2: Building and Connecting to Your Data Lake

6. Chapter 4: Metastores, Data Sources, and Data Lakes

7. Chapter 5: Securing Your Data

8. Chapter 6: AWS Glue and AWS Lake Formation

9. Section 3: Using Amazon Athena

10. Chapter 7: Ad Hoc Analytics

11. Chapter 8: Querying Unstructured and Semi-Structured Data

12. Chapter 9: Serverless ETL Pipelines

13. Chapter 10: Building Applications with Amazon Athena

14. Chapter 11: Operational Excellence – Monitoring, Optimization, and Troubleshooting

Technical requirements

15. Section 4: Advanced Topics

16. Chapter 12: Athena Query Federation

17. Chapter 13: Athena UDFs and ML

18. Chapter 14: Lake Formation – Advanced Topics

19. Other Books You May Enjoy

Running your first query

Athena supports both Data Definition Language (DDL) and Data Manipulation Language (DML) queries. Queries where you SELECT data from a table are a common example of DML queries. Our first meaningful Athena query will be a DDL query that creates, or defines, our NYC Taxis data table:

Let's begin by ensuring our AWS account and IAM user/role are ready to use Athena. To do that, navigate to the Athena query editor in the AWS Console: https://console.aws.amazon.com/athena/home.
Be sure to use the same region that you uploaded your data and created your database in.
If this is your first time using Athena, you will likely be met by a screen like the following. Luckily, Athena is telling us that "Before you run your first query, you need to set up a query result location in Amazon S3…". Since Athena writes the results of all queries to S3, even DDL queries, we will need to configure this setting before we can proceed. To do so, click on the highlighted text in the AWS Console that's shown in the following screenshot:
Figure 1.1 – The prompt for setting the query result's location upon your first visit to Athena
After clicking on the modal's link, you will see the following prompt so that you can set your query result's location. You can use the same S3 bucket we used to upload our sample data, with results being used as the name of the folder that Athena will write query results to within that bucket. Be sure your location ends with a "/" to avoid errors:

Figure 1.2 – Athena's settings prompt for the query result's location

Next, let's learn how to create a table.

Creating your first table

It is now time to run our first Athena query. The following DDL query asks Athena to create a new table called nyc_taxi in the packt_serverless_analytics database, which is stored in the AWS Glue Data Catalog. The query also specifies the schema (columns), file format, and storage location of the table. For now, the other nuances of this create query are unimportant. You may find it easier to copy create table from the create_nyc_taxi.sql (http://bit.ly/3mXj3K0) file in the chapter_1 folder of this book's GitHub repository. Paste it into Athena's query editor, change LOCATION so that it matches your bucket name, and click Run query. It should complete in a few seconds:

CREATE EXTERNAL TABLE 'packt_serverless_analytics'.'nyc_taxi'(
  'vendorid' bigint, 
  'tpep_pickup_datetime' string, 
  'tpep_dropoff_datetime' string, 
  'passenger_count' bigint, 
  'trip_distance' double, 
  'ratecodeid' bigint, 
  'store_and_fwd_flag' string, 
  'pulocationid' bigint, 
  'dolocationid' bigint, 
  'payment_type' bigint, 
  'fare_amount' double, 
  'extra' double, 
  'mta_tax' double, 
  'tip_amount' double, 
  'tolls_amount' double, 
  'improvement_surcharge' double, 
  'total_amount' double, 
  'congestion_surcharge' double)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://<YOUR_BUCKET_NAME>/tables/nyc_taxi/'
TBLPROPERTIES (
  'areColumnsQuoted'='false', 
  'columnsOrdered'='true', 
  'compressionType'='gzip', 
  'delimiter'=',',
  'skip.header.line.count'='1', 
  'typeOfData'='file')

Once your table creation DDL query completes, the left navigation pane of the Athena console will refresh with the definition of your new table. If you have other databases and tables, you may need to choose your database from the dropdown before your new table will appear.

Figure 1.3 – Athena's Database navigator will show the schema of your newly created table

At this point, the significance of the query we just ran may not be entirely apparent, but rest assured we will go deeper into why serverless DDL queries are a powerful thing. Oh, and did we mention that Athena does not charge for DDL queries?

Running your first analytics queries

When working with a new or unfamiliar set of data, it can be helpful to view a sample of the rows before exploring the dataset in more meaningful ways. This allows you to understand the schema of your dataset, including verifying that the schema (for example, column names) match the values and types. There are a few ways to do this, including the following limit query:

SELECT * from packt_serverless_analytics.nyc_taxi limit 100

This works fine in most cases, but we can do better. Many query engines, Athena included, will end up returning all 100 rows requested in the preceding query from the same S3 object. If your dataset contains many objects or files, you are getting an extremely narrow view of the table. For that reason, I prefer using the following query to view data from a broader portion of the dataset:

SELECT *
FROM packt_serverless_analytics.nyc_taxi TABLESAMPLE BERNOULLI (1) 
limit 100

This query is like the earlier limit query but uses Athena's TABLESAMPLE feature to obtain our 100 requested rows using BERNOULLI sampling. When a table is sampled using the Bernoulli method, all the objects of the table may be scanned as opposed to likely stopping after the first object. This is because the probability of a row being included in the result is independent of any other row reducing the significance of the object scan order. In the following screenshot, we can see some of the rows that were returned using TABLESAMPLE with the BERNOULLI method:

Figure 1.4 – Results of executing TAMPLESAMPLE against our nyc_taxi table

While that query allowed us to confirm that Athena can indeed access our data and that the schema appears to match the data itself, we have not extracted any real insights from the data. For this, we will run our first real analytics query by generating a histogram of ride durations and distances. Our goal here is to learn how much time people are typically spending in taxis, but we'll also be able to gain insights into the quality of our data. The following query uses Athena's numeric_histogram function to approximate the distribution with 10 buckets according to the difference between tpep_pickup_datetime and tpep_dropoff_datetime. Since the dataset stores datetimes as strings, we are using the date_parse function to convert the values into actual timestamps that we can then use with Athena's date_diff function to generate the ride durations as minutes. Lastly, the query uses a CROSS JOIN with UNEST to turn the histogram into rows and columns. Normally, the numeric_histogram function returns a map containing the histogram, but this can be difficult to read. UNEST helps us turn it into a more intuitive tabular format. Do not worry about remembering all these functions and SQL techniques right now. Athena frequently adds new capabilities, and you can always consult a reference.

You can copy the following code from GitHub at http://bit.ly/2Jm6o5v:

SELECT ride_minutes, number_rides
    FROM (SELECT numeric_histogram(10,
        date_diff('minute',
         date_parse(tpep_pickup_datetime,'%Y-%m-%d %H:%i:%s'),
         date_parse(tpep_dropoff_datetime, '%Y-%m-%d %H:%i:%s')
         )
    )
FROM packt_serverless_analytics.nyc_taxi ) AS x (ride_histogram)
CROSS JOIN 
    UNNEST(ride_histogram) AS t (ride_minutes, number_rides);

Once you run the query, the results will look as follows. You can experiment with the number of buckets that are generated by adjusting the parameters of the numeric_histogram function. Generating 100 or even 1,000 buckets can uncover patterns that were hidden with fewer buckets. Even with just 10 buckets, we can already see a strong correlation between the distance and the number of rides. I was surprised to see that such a large portion of the yellow cab rides lasted less than 7 minutes. From this query, we can also see some likely data quality issues in the dataset. Unless one of the June 2020 rides happened in a time-traveling DeLorean, we likely have an erroneous record. Less obvious is the fact that several hundred rides claim to have lasted longer than 24 hours:

Figure 1.5 – Ride duration histogram results

Let's try one more histogram query, but this time, we will target the trip distance of the rides that took less than 7 minutes. The following code block contains the modified histogram query you can run to understand that bucket of rides. You can download it from GitHub at http://bit.ly/3hkggJl:

SELECT trip_distance, number_rides
FROM 
    (SELECT numeric_histogram(5,trip_distance)
       FROM packt_serverless_analytics.nyc_taxi 
       WHERE date_diff('minute',
         date_parse(tpep_pickup_datetime,'%Y-%m-%d %H:%i:%s'),
         date_parse(tpep_dropoff_datetime, '%Y-%m-%d %H:%i:%s')
         ) <= 6.328061
    ) AS x (ride_histogram)
CROSS JOIN UNNEST(ride_histogram) AS t (trip_distance , number_rides);

Considering that the average person can walk a mile in 15 minutes, New Yorkers must be in a serious hurry to opt for taxi rides instead of a 15-minute walk!

Figure 1.6 – Ride distance histogram results

With that, we've been through the basics of AWS Athena. Let's conclude by providing a recap of what we've learned.

You're reading from Serverless Analytics with Amazon Athena Query structured, unstructured, or semi-structured data in seconds without setting up any infrastructure

Table of Contents (20) Chapters

Running your first query

Creating your first table

Running your first analytics queries

Authors (3)

Personalised recommendations for you