Analytics functions in Hive
Hive provides the following set of analytical functions:
RANK
DENSE_RANK
ROW_NUMBER
PERCENT_RANK
CUME_DIST
NTILE
Common and useful sets of analytical functions are ranking functions where rows from resultset are ranked according to a scheme.
How to do it…
Let's analyze each function in detail. We will be using the same sales
dataset and applying analytical functions to it:
ROW_NUMBER
: This function will provide a unique number to each row in resultset based on theORDER BY
clause within thePARTITION
. For example, if we want to assignrow_number
to eachfname
, which is also partitioned by IP address in thesales
dataset, the query would be:hive> select fname,ip,ROW_NUMBER() OVER (ORDER BY ip ) as rownum from sales;
RANK
: It is similar toROW_NUMBER
, but the equal rows are ranked with the same number. For example, if we useRANK
in the previous query instead ofROW_NUM
:hive> select fname,ip,RANK() OVER (ORDER BY ip) as ranknum, RANK() OVER (PARTITION BY ip order...