Hive is a standard for SQL queries over petabytes of data in Hadoop. It provides SQL-like access to data in HDFS, enabling Hadoop to be used as a data warehouse. The Hive Query Language (HQL) has similar semantics and functions as standard SQL in the relational database, so that experienced database analysts can easily get their hands on it. Hive's query language can run on different computing engines, such as MapReduce, Tez, and Spark.
Hive's metadata structure provides a high-level, table-like structure on top of HDFS. It supports three main data structures, tables, partitions, and buckets. The tables correspond to HDFS directories and can be divided into partitions, where data files can be divided into buckets. Hive's metadata structure is usually the Schema of the Schema-on-Read concept on Hadoop, which means you do not have to define the schema in Hive before you store data in HDFS. Applying Hive metadata after storing data brings more flexibility and efficiency to your data work. The popularity of Hive's metadata makes it the de facto way to describe big data and is used by many tools in the big data ecosystem.
The following diagram is the architecture view of Hive in the Hadoop ecosystem. The Hive metadata store (also called the metastore) can use either embedded, local, or remote databases. The thrift server is built on Apache Thrift Server technology. With its latest version 2, hiveserver2 is able to handle multiple concurrent clients, support Kerberos, LDAP, and custom pluggable authentication, and provide better options for JDBC and ODBC clients, especially for metadata access.
Here are some highlights of Hive that we can keep in mind moving forward:
- Hive provides a simple and optimized query model with less coding than MapReduce
- HQL and SQL have a similar syntax
- Hive's query response time is typically much faster than others on the same volume of big datasets
- Hive supports running on different computing frameworks
- Hive supports ad hoc querying data on HDFS and HBase
- Hive supports user-defined java/scala functions, scripts, and procedure languages to extend its functionality
- Matured JDBC and ODBC drivers allow many applications to pull Hive data for seamless reporting
- Hive allows users to read data in arbitrary formats, using SerDes and Input/Output formats
- Hive is a stable and reliable batch-processing tool, which is production-ready for a long time
- Hive has a well-defined architecture for metadata management, authentication, and query optimizations
- There is a big community of practitioners and developers working on and using Hive