Creating buckets in Hive
In the scenario where we query on a unique values column of a dataset, partitioning is not a good fit. If we go with a partition on a column with high unique values like ID, it would create a large number of small datasets in HDFS and partition entries in the metastore, thus increasing the load on NameNode
and the metastore service.
To optimize queries on such a dataset, we group the data into a particular number of buckets and the data is divided into the maximum number of buckets.
How to do it…
Using the same sales
dataset, if we need to optimize queries on a column with high unique column values such as ID, we create buckets on that column as follows:
create table sales_buck (id int, fname string, state string, zip string, ip string, pid string) clustered by (id) into 50 buckets row format delimited fields terminated by '\t';
Here, we have defined 50 buckets for this table, which means that the complete dataset is divided and stored in 50 buckets based on the ID
...