Data bucketing
Another scheme to partition data is to use buckets within a single partition. When using bucketing, a column or multiple columns are used to group rows together and "bucket" or categorize them. The best columns to use for bucketing are columns that will often be used to filter the data. So, when queries use these columns as filters, not as much data will need to be scanned and read when performing these queries.
Another characteristic that makes a column a good candidate for bucketing is high cardinality. In other words, you want to use columns that have a large number of unique values. So, primary key columns are ideal bucketing columns.
Amazon Athena simplifies which columns will be bucketed during table creation by using the CLUSTERED BY
clause. An example of a table creation statement using this clause follows:
CREATE EXTERNAL TABLE employee (
id string,
name string,
salary double,
address string,
timestamp bigint)
PARTITIONED BY (
timestamp string...