You're reading from Apache Hive Essentials Immerse yourself on a fantastic journey to discover the attributes of big data by using Hive

Product type Paperback

Published in Feb 2015

Publisher Packt

ISBN-13 9781783558575

Length 208 pages

Edition 1st Edition

Languages

SQL

Tools

Hive

Concepts

Data Processing

Author (1):

Dayong Du

View More author details

Table of Contents (12) Chapters

Preface

1. Overview of Big Data and Hive FREE CHAPTER

2. Setting Up the Hive Environment

3. Data Definition and Description

4. Data Selection and Scope

5. Data Manipulation

6. Data Aggregation and Sampling

7. Performance Considerations

8. Extensibility Considerations

9. Security Considerations

10. Working with Other Tools

Index

Hive overview

Hive is a standard for SQL queries over petabytes of data in Hadoop. It provides SQL-like access for data in HDFS making Hadoop to be used like a warehouse structure. The Hive Query Language (HQL) has similar semantics and functions as standard SQL in the relational database so that experienced database analysts can easily get their hands on it. Hive's query language can run on different computing frameworks, such as MapReduce, Tez, and Spark for better performance.

Hive's data model provides a high-level, table-like structure on top of HDFS. It supports three data structures: tables, partitions, and buckets, where tables correspond to HDFS directories and can be divided into partitions, which in turn can be divided into buckets. Hive supports a majority of primitive data formats such as TIMESTAMP, STRING, FLOAT, BOOLEAN, DECIMAL, DOUBLE, INT, SMALLINT, BIGINT, and complex data types, such as UNION, STRUCT, MAP, and ARRAY.

The following diagram is the architecture seen inside the view of Hive in the Hadoop ecosystem. The Hive metadata store (or called metastore) can use either embedded, local, or remote databases. Hive servers are built on Apache Thrift Server technology. Since Hive has released 0.11, Hive Server 2 is available to handle multiple concurrent clients, which support Kerberos, LDAP, and custom pluggable authentication, providing better options for JDBC and ODBC clients, especially for metadata access.

Hive architecture

Here are some highlights of Hive that we can keep in mind moving forward:

Hive provides a simpler query model with less coding than MapReduce
HQL and SQL have similar syntax
Hive provides lots of functions that lead to easier analytics usage
The response time is typically much faster than other types of queries on the same type of huge datasets
Hive supports running on different computing frameworks
Hive supports ad hoc querying data on HDFS
Hive supports user-defined functions, scripts, and a customized I/O format to extend its functionality
Hive is scalable and extensible to various types of data and bigger datasets
Matured JDBC and ODBC drivers allow many applications to pull Hive data for seamless reporting
Hive allows users to read data in arbitrary formats, using SerDes and Input/Output formats
Hive has a well-defined architecture for metadata management, authentication, and query optimizations
There is a big community of practitioners and developers working on and using Hive

You're reading from Apache Hive Essentials Immerse yourself on a fantastic journey to discover the attributes of big data by using Hive

Table of Contents (12) Chapters

Hive overview

Authors (1)

Personalised recommendations for you