Apache Hive Cookbook

Chapter 1. Developing Hive

In this chapter, we will cover the following recipes:

Deploying Hive on a Hadoop cluster
Deploying Hive Metastore
Installing Hive
Configuring HCatalog
Understanding different components of Hive
Compiling Hive from source
Hive packages
Debugging Hive
Running Hive
Changing configurations at runtime

Deploying Hive Metastore

Apache Hive is a client-side library that provides a table-like abstraction on top of the data in HDFS for data processing. Hive jobs are converted into a map reduce plan, which is then submitted to the Hadoop cluster. Hadoop cluster is the set of nodes or machines with HDFS, MapReduce, and YARN deployed on these machines. MapReduce works on the distributed data stored in HDFS and processes a large datasets in parallel, as compared with traditional processing engines that process whole task on a single machine and wait for hours or days for a single query. Yet Another Resource Negotiator (YARN) is used to manage RAM the and CPU cores of the whole cluster, which are critical for running any process on a node.

The Hive table and database definitions and mapping to the data in HDFS is stored in a metastore. A metastore is a central repository for Hive metadata. A metastore consists of two main components, which are really important for working on Hive. Let's take a look at these components:

Services to which the client connects and queries the metastore
A backing database to store the metadata

Getting ready

In this book, we will assume a GNU/Linux-based installation of Apache Hive for installation and other instructions.

Before installing Hive, the first step is to make sure that a Java SE environment is installed properly. Hive requires version 6 or later, which can be downloaded from http://www.oracle.com/technetwork/java/javase/downloads/index.html.

How to do it…

In Hive, a metastore (service and RDBMS database) could be configured in one of the following ways:

An embedded metastore
A local metastore
A remote metastore

When we install Hive on the preinstalled Hadoop cluster, Hive, by default, gets the embedded database. This means that we need not configure any database as a Hive metastore. Let's check out what these configurations are and why we call them the embedded and remote metastore.

By default, the metastore service and the Hive service run in the same JVM. Hive needs a database to store metadata. In default mode, it uses an embedded Derby database stored on the local file system. The embedded mode of Hive has the limitation that only one session can be opened at a time from the same location on a machine as only one embedded Derby database can get lock and access the database files on disk:

An Embedded Metastore has a single service and a single JVM that cannot work with multiple nodes at a time.

To solve this limitation, a separate RDBMS database runs on same node. The metastore service and Hive service still run in the same JVM. This configuration mode is named local metastore. Here, local means the same environment of the JVM machine as well as the service in the same node.

There is one more configuration where one or more metastore servers run in a separate JVM process to the Hive service connecting to a database on a remote machine. This configuration is named remote metastore.

The Hive service is configured to use a remote metastore by setting hive.metastore.uris to metastore server URIs, separated by commas. The Hive metastore could be configured using properties specified in the following sections.

In the following diagram, the pictorial representation of the metastore and driver is given:

<property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/Hive/warehouse </value>
    <description>The directory relative to fs.default.name where managed tables are stored.
    </description>
</property>

<property>
    <name> hive.metastore.uris</name>
    <value></value>
    <description> The URIs specifying the remote metastore servers to connect to. If there are multiple remote servers, clients connect in a round-robin fashion
    </description>
</property>

<property>
    <name>javax.jdo.option. ConnectionURL</name>
    <value>jdbc:derby:;databaseName=hivemetastore;create=true</value>
    <description> The JDBC URL of database.
    </description>
</property>

<property>
    <name> javax.jdo.option.ConnectionDriverName </name>
    <value> org.apache.derby.jdbc.EmbeddedDriver </value>
    <description> The JDBC driver classname.
    </description>
</property>
<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>username</value>
    <description>metastore username to connect with
    </description>
</property>

<property>
    <name> javax.jdo.option.ConnectionPassword </name>
    <value>password</value>
    <description>metastore password to connect with
    </description>
</property>

Configuring HCatalog

Assuming that Hive has been configured in the remote metastore, let's look into how to install and configure HCatalog.

Getting ready

The HCatalog CLI supports these command-line options:

Option	Usage	Description
`-g`	`hcat -g mygrp`	The HCatalog table, which needs to be created, must have the group "`mygrp`".
`-p`	`hcat -p rwxrwxr-x`	The HCatalog table, which needs to be created, must have permissions "`rwxrwxr-x`".
`-f`	`hcat -f myscript.hcat`	Tells HCatalog that `myscript.hcat` is a file containing DDL commands to execute.
`-e`	`hcat -e 'create table mytable(a int);'`	Treat the following string as a DDL command and execute it.
`-D`	`hcat -Dkey=value`	Pass the key-value pair to HCatalog as a Java System Property.
	`Hcat`	Prints a usage message.

How to do it...

Hive 0.11.0 HCatalog is packaged with Hive binaries. Because we have already configured Hive, we could access the HCatalog command-line hcat command on shell. The script is available at the hcatalog/bin directory.

Understanding different components of Hive

Besides the Hive metastore, Hive components could be broadly classified as Hive clients and Hive servers. Hive servers provide interfaces to make the metastore available to external applications and check for user's authorization and authentication, and Hive clients are various applications used to access and execute Hive queries on the Hadoop cluster.

HiveServer

Let's take a look at its various components.

Hive metastore

Hive metastore URIs start a metastore service on the specified port. Metastore provides APIs to query the database, tables, schema, and other entities stored in the RDBMS datastore.

How to do it...

The metastore service starts as a Java process in the backend. You can start the Hive metastore service with the following command:

hive --service metastore &

HiveServer2

HiveServer2 is an interface that allows clients to execute Hive queries and get the result. It is based on Thrift RPC and supports multiple clients a against single client in HiveServer. It also provisioned for the authentication and authorization of the user.

How to do it...

The HiveServer2 service also starts as a Java process in the backend. You can start HiveServer2 with the following command:

hive --service hiveserver2 &

Hive clients

The following are the different clients available in Hive to query metastore data or to submit Hive queries to Hive servers.

Hive CLI

The following are the various sections included in Hive CLI.

Getting ready

Hive Command-line Interface (CLI) can be used to run Hive queries in either interactive or batch mode.

How to do it...

To run Hive CLI, use the following command:

$ HIVE_HOME/bin/hive

Queries are submitted by username of the user logged in to the UNIX system.

Beeline

The following are the various sections included in Beeline.

Getting ready

If you have configured HiveServer2, then a Beeline client can be used to interact with Hive.

How to do it...

To run Beeline, use the following command:

$ HIVE_HOME/bin/beeline

Using beeline, a connection could be made to any HiveServer2 instance with any username and password.

Compiling Hive from source

In this recipe, we will see how to compile Hive from source.

Getting ready

Apache Hive is an open source framework available for compilation and modification by any user. Hive source code is a maven project. The source has intermittent scripts executed on a UNIX platform during compilation.

The following prerequisites need to be installed:

UNIX OS: UNIX is preferable for Hive source compilation. Although the source could also be compiled on Windows, you need to comment out the intermittent scripts execution.
Maven: The following are the steps to configure maven:
1. Download the Apache maven binaries for Linux (.tar.gz) from https://maven.apache.org/download.cgi.
```
wget http://mirror.olnevhost.net/pub/apache/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz
```
2. Extract the tar file:
```
tar -xzvf apache-maven-3.3.3-bin.tar.gz
```
3. Create a folder and move maven binaries to that folder:
```
sudo mkdir –p /usr/lib/maven
mv apache-maven-3.3.3-bin/usr/lib/maven/
```
4. Open /etc/environment:
```
sudo nano /etc/profile
```
5. Add the following variable for the environment PATH:
```
export M2_HOME=/usr/lib/maven/apache-maven-3.3.3-bin
export M2=$M2_HOME/bin
export PATH=$M2:$PATH
```
6. Use the command source /etc/environment to add variables to PATH without restart:
```
source /etc/environment
```
7. Check whether maven is properly installed or not:
```
mvn –version
```

How to do it...

Follow these steps to compile Hive on a Unix OS:

Download the latest version of the Hive source tar file:

sudo wget http://a.mbbsindia.com/hive/hive-1.2.1/apache-hive-1.2.1-src.tar.gz

Extract the source folder:

tar –xzvf apache-hive-1.2.1-src.tar.gz

Move to the Hive directory:
```
cd apache-hive-1.2.1-src
```
To import Hive packages in eclipse, run the following command:
```
mvn eclipse:eclipse
```
To compile Hive with Hadoop 2 binaries, run the following command:
```
mvn clean install -Phadoop-2,dist
```
In case you want to skip tests execution, run the earlier command with the following switch:
```
mvn clean install –DskipTests -Phadoop-2,dist
```
To generate a tarball file from the source code, run the following command:
```
mvn clean package -DskipTests -Phadoop-2 -Pdist
```

Hive packages

The following are the various sections included in Hive packages.

Getting ready

Hive source consists of different modules categorized by the features they provide or as a submodule of some other module.

How to do it...

The following is the list of Hive modules and their usage in Hive:

accumulo-handler: Apache accumulo is a distributed key-value datastore based on Google Big Table. This package includes the components responsible for mapping the Hive table to the accumulo table. AccumuloStorageHandler and AccumuloPredicateHandler are the main classes responsible for mapping tables. For more information, refer to the official integration documentation available at https://cwiki.apache.org/confluence/display/Hive/AccumuloIntegration.
ant: This tool is used to build earlier versions of Hive source. Ant is also needed to configure the Hive Web Interface server.
beeline: A Hive client used to connect with HiveServer2 and run Hive queries.
bin: This package includes scripts to start Hive clients and services.
cli: This is a Hive Command-line Interface implementation.
common: These are utility classes used by other modules.
conf: This contains default configurations and uses defined configuration objects.
contrib: This contains Serdes, generic UDF, and fileformat contributed by third parties to Hive.
hbase-handler: This module allows Hive SQL statements to access HBase tables for SELECT and INSERT commands. It also provides interfaces to access HBase and Hive tables for join and union in a single query. More information is available at https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration.
hcatalog: This is a table management framework that helps other frameworks such as Pig or MapReduce to access the Hive metastore and table schema.
hwi: This module provides an implementation of a web interface to run Hive queries. Also, the WebHCat APIs provide REST APIs to access the Hive metastore.
Jdbc: This is a connector that accepts JDBC connections and calls to execute Hive queries on the cluster.
Metastore: This is the API that provides access to metastore entities including database, table, schema, and serdes.
odbc: This module implements the Open Database Connectivity (ODBC) API, enabling ODBC applications to connect and execute queries over Hive.
ql: This module provides an interface to clients that checks for query semantics and provides an implementation for driver, parser, and query planner.
Serde: This module has an implementation of serializer and deserializer used by Hive to read and write data. It helps in validating and parsing record and field types.
shims: This is the module that transparently intercepts and modifies calls to the Hive API, usually for compatibility purposes.
spark-client: This module provides an interface to execute Hive SQLs on a Spark framework.