Top K statistics in Hive
It is the mechanism of collecting the top K column values of a Hive table. In this, the top K values of the most skewed column are stored in the partition. This is applicable for both existing and newly created tables.
How to do it…
Top K statistics computation is disabled by default. The following are some of the properties that could be set to compute and store top K statistics:
hive.stats.topk.collect
This would enable computing top K and putting it into skewed information:
Default Value:
false
Valid Values:
true
,false
hive.stats.topk.num
Using this property, you can specify K value for your top K result
hive.stats.topk.minpercent
It is the minimal percentage of a row value to be in top K result
It could be any
float
value between 0.0 and 100
Let's set the following properties for top K statistics:
hive> set hive.stats.topk.collect=true; hive> set hive.stats.topk.num=4; hive> set hive.stats.topk.minpercent=0; hive> set hive.stats.topk.poolsize=100;
First, let...