Hive overview
Hive is a standard for SQL queries over petabytes of data in Hadoop. It provides SQL-like access for data in HDFS making Hadoop to be used like a warehouse structure. The Hive Query Language (HQL) has similar semantics and functions as standard SQL in the relational database so that experienced database analysts can easily get their hands on it. Hive's query language can run on different computing frameworks, such as MapReduce, Tez, and Spark for better performance.
Hive's data model provides a high-level, table-like structure on top of HDFS. It supports three data structures: tables, partitions, and buckets, where tables correspond to HDFS directories and can be divided into partitions, which in turn can be divided into buckets. Hive supports a majority of primitive data formats such as TIMESTAMP
, STRING
, FLOAT
, BOOLEAN
, DECIMAL
, DOUBLE
, INT
, SMALLINT
, BIGINT
, and complex data types, such as UNION
, STRUCT
, MAP
, and ARRAY
.
The following diagram is the architecture seen inside the view of Hive in the Hadoop ecosystem. The Hive metadata store (or called metastore) can use either embedded, local, or remote databases. Hive servers are built on Apache Thrift Server technology. Since Hive has released 0.11, Hive Server 2 is available to handle multiple concurrent clients, which support Kerberos, LDAP, and custom pluggable authentication, providing better options for JDBC and ODBC clients, especially for metadata access.
Here are some highlights of Hive that we can keep in mind moving forward:
- Hive provides a simpler query model with less coding than MapReduce
- HQL and SQL have similar syntax
- Hive provides lots of functions that lead to easier analytics usage
- The response time is typically much faster than other types of queries on the same type of huge datasets
- Hive supports running on different computing frameworks
- Hive supports ad hoc querying data on HDFS
- Hive supports user-defined functions, scripts, and a customized I/O format to extend its functionality
- Hive is scalable and extensible to various types of data and bigger datasets
- Matured JDBC and ODBC drivers allow many applications to pull Hive data for seamless reporting
- Hive allows users to read data in arbitrary formats, using SerDes and Input/Output formats
- Hive has a well-defined architecture for metadata management, authentication, and query optimizations
- There is a big community of practitioners and developers working on and using Hive