The SELECT statement
The most common use case of using Hive is to query the data in Hadoop. To achieve this, we need to write and execute the SELECT
statement in Hive. The typical work done by the SELECT
statement is to project the rows meeting query conditions specified in the WHERE
clause after the target table and return the result set. The SELECT
statement is quite often used with FROM
, DISTINCT
, WHERE
, and LIMIT
keywords. We will introduce them through examples as follows.
The
SELECT *
statement here means all the columns in the table are selected. By default, all rows are returned including duplicated rows. If the DISTINCT
keyword is used, only unique rows from the table are selected and returned. The LIMIT
keyword is used to limit the number of rows returned randomly. In addition, SELECT *
scans the whole table/file without triggering MapReduce jobs, so it runs faster than SELECT <column_name>
. Since Hive 0.10.0, the simple SELECT
statements, such as SELECT <column_name>...