Running ETL queries
While this book's goal is not to teach Structured Query Language (SQL), it is beneficial to spend some time reviewing everyday SQL recipes and how they relate to Athena's strengths and quirks. Transforming data from one format to another, producing intermediate datasets, or simply running a query that outputs many megabytes (MB) or gigabytes (GB) of output necessitates some understanding of Athena's best practices to achieve peak price/performance. As we did in Chapter 1, Your First Query, let's start by preparing a larger dataset for our exercises.
We will continue using the NYC Yellow Taxi
dataset, but we will prepare 2.5 years of this data this time. Preparing this expanded dataset will entail downloading, compressing, and then uploading dozens of files to S3. To expedite that process, you can use the following script to automate the steps. To do so, add all the files from yellow_tripdata_2018-01.csv
through yellow_tripdata_2020-06.csv...