You're reading from Apache Hive Cookbook

Product type Paperback

Published in Apr 2016

Publisher Packt

ISBN-13 9781782161080

Length 268 pages

Edition 1st Edition

Languages

Java

Tools

Hadoop

Concepts

Data Processing

Table of Contents (14) Chapters

Preface

1. Developing Hive FREE CHAPTER

2. Services in Hive

3. Understanding the Hive Data Model

4. Hive Data Definition Language

5. Hive Data Manipulation Language

6. Hive Extensibility Features

7. Joins and Join Optimization

8. Statistics in Hive

9. Functions in Hive

10. Hive Tuning

11. Hive Security

12. Hive Integration with Other Frameworks

Index

Deploying Hive Metastore

Apache Hive is a client-side library that provides a table-like abstraction on top of the data in HDFS for data processing. Hive jobs are converted into a map reduce plan, which is then submitted to the Hadoop cluster. Hadoop cluster is the set of nodes or machines with HDFS, MapReduce, and YARN deployed on these machines. MapReduce works on the distributed data stored in HDFS and processes a large datasets in parallel, as compared with traditional processing engines that process whole task on a single machine and wait for hours or days for a single query. Yet Another Resource Negotiator (YARN) is used to manage RAM the and CPU cores of the whole cluster, which are critical for running any process on a node.

The Hive table and database definitions and mapping to the data in HDFS is stored in a metastore. A metastore is a central repository for Hive metadata. A metastore consists of two main components, which are really important for working on Hive. Let's take a look at these components:

Services to which the client connects and queries the metastore
A backing database to store the metadata

Getting ready

In this book, we will assume a GNU/Linux-based installation of Apache Hive for installation and other instructions.

Before installing Hive, the first step is to make sure that a Java SE environment is installed properly. Hive requires version 6 or later, which can be downloaded from http://www.oracle.com/technetwork/java/javase/downloads/index.html.

How to do it…

In Hive, a metastore (service and RDBMS database) could be configured in one of the following ways:

An embedded metastore
A local metastore
A remote metastore

When we install Hive on the preinstalled Hadoop cluster, Hive, by default, gets the embedded database. This means that we need not configure any database as a Hive metastore. Let's check out what these configurations are and why we call them the embedded and remote metastore.

By default, the metastore service and the Hive service run in the same JVM. Hive needs a database to store metadata. In default mode, it uses an embedded Derby database stored on the local file system. The embedded mode of Hive has the limitation that only one session can be opened at a time from the same location on a machine as only one embedded Derby database can get lock and access the database files on disk:

An Embedded Metastore has a single service and a single JVM that cannot work with multiple nodes at a time.

To solve this limitation, a separate RDBMS database runs on same node. The metastore service and Hive service still run in the same JVM. This configuration mode is named local metastore. Here, local means the same environment of the JVM machine as well as the service in the same node.

There is one more configuration where one or more metastore servers run in a separate JVM process to the Hive service connecting to a database on a remote machine. This configuration is named remote metastore.

The Hive service is configured to use a remote metastore by setting hive.metastore.uris to metastore server URIs, separated by commas. The Hive metastore could be configured using properties specified in the following sections.

In the following diagram, the pictorial representation of the metastore and driver is given:

<property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/Hive/warehouse </value>
    <description>The directory relative to fs.default.name where managed tables are stored.
    </description>
</property>

<property>
    <name> hive.metastore.uris</name>
    <value></value>
    <description> The URIs specifying the remote metastore servers to connect to. If there are multiple remote servers, clients connect in a round-robin fashion
    </description>
</property>

<property>
    <name>javax.jdo.option. ConnectionURL</name>
    <value>jdbc:derby:;databaseName=hivemetastore;create=true</value>
    <description> The JDBC URL of database.
    </description>
</property>

<property>
    <name> javax.jdo.option.ConnectionDriverName </name>
    <value> org.apache.derby.jdbc.EmbeddedDriver </value>
    <description> The JDBC driver classname.
    </description>
</property>
<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>username</value>
    <description>metastore username to connect with
    </description>
</property>

<property>
    <name> javax.jdo.option.ConnectionPassword </name>
    <value>password</value>
    <description>metastore password to connect with
    </description>
</property>

You're reading from Apache Hive Cookbook

Table of Contents (14) Chapters

Deploying Hive Metastore

Getting ready

How to do it…

Personalised recommendations for you