Column statistics in Hive
Similar to table and partition statistics, Hive also supports the analysis of column statistics. The following are the statistics captured by Hive when a column or set of columns are analyzed:
The number of distinct values
The number of NULL values
Minimum or maximum K values where K could be given by a user
Histogram: frequency and height balanced
Average size of the column
Average or sum of all values in the column if their type is numerical
Percentiles of the value
How to do it…
As discussed in the previous recipe, Hive provides the analyze
command to compute table or partition statistics. The same command could be used to compute statistics for one or more column of a Hive table or partition. The HiveQL in order to compute column statistics is as follows:
hive> ANALYZE TABLE t1 [PARTITION p1] COMPUTE STATISTICS FOR [COLUMNS c1, c2..]
Note
An analyze
command does not support table or column aliases.
In the following example, the use of the analyze
command is illustrated...