Deploying Hive Metastore
Apache Hive is a client-side library that provides a table-like abstraction on top of the data in HDFS for data processing. Hive jobs are converted into a map reduce plan, which is then submitted to the Hadoop cluster. Hadoop cluster is the set of nodes or machines with HDFS, MapReduce, and YARN deployed on these machines. MapReduce works on the distributed data stored in HDFS and processes a large datasets in parallel, as compared with traditional processing engines that process whole task on a single machine and wait for hours or days for a single query. Yet Another Resource Negotiator (YARN) is used to manage RAM the and CPU cores of the whole cluster, which are critical for running any process on a node.
The Hive table and database definitions and mapping to the data in HDFS is stored in a metastore. A metastore is a central repository for Hive metadata. A metastore consists of two main components, which are really important for working on Hive. Let's take a look at these components:
- Services to which the client connects and queries the metastore
- A backing database to store the metadata
Getting ready
In this book, we will assume a GNU/Linux-based installation of Apache Hive for installation and other instructions.
Before installing Hive, the first step is to make sure that a Java SE environment is installed properly. Hive requires version 6 or later, which can be downloaded from http://www.oracle.com/technetwork/java/javase/downloads/index.html.
How to do it…
In Hive, a metastore (service and RDBMS database) could be configured in one of the following ways:
- An embedded metastore
- A local metastore
- A remote metastore
When we install Hive on the preinstalled Hadoop cluster, Hive, by default, gets the embedded database. This means that we need not configure any database as a Hive metastore. Let's check out what these configurations are and why we call them the embedded and remote metastore.
By default, the metastore service and the Hive service run in the same JVM. Hive needs a database to store metadata. In default mode, it uses an embedded Derby database stored on the local file system. The embedded mode of Hive has the limitation that only one session can be opened at a time from the same location on a machine as only one embedded Derby database can get lock and access the database files on disk:
An Embedded Metastore has a single service and a single JVM that cannot work with multiple nodes at a time.
To solve this limitation, a separate RDBMS database runs on same node. The metastore service and Hive service still run in the same JVM. This configuration mode is named local metastore. Here, local means the same environment of the JVM machine as well as the service in the same node.
There is one more configuration where one or more metastore servers run in a separate JVM process to the Hive service connecting to a database on a remote machine. This configuration is named remote metastore.
The Hive service is configured to use a remote metastore by setting hive.metastore.uris
to metastore server URIs, separated by commas. The Hive metastore could be configured using properties specified in the following sections.
In the following diagram, the pictorial representation of the metastore and driver is given:
<property> <name>hive.metastore.warehouse.dir</name> <value>/user/Hive/warehouse </value> <description>The directory relative to fs.default.name where managed tables are stored. </description> </property> <property> <name> hive.metastore.uris</name> <value></value> <description> The URIs specifying the remote metastore servers to connect to. If there are multiple remote servers, clients connect in a round-robin fashion </description> </property> <property> <name>javax.jdo.option. ConnectionURL</name> <value>jdbc:derby:;databaseName=hivemetastore;create=true</value> <description> The JDBC URL of database. </description> </property> <property> <name> javax.jdo.option.ConnectionDriverName </name> <value> org.apache.derby.jdbc.EmbeddedDriver </value> <description> The JDBC driver classname. </description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>username</value> <description>metastore username to connect with </description> </property> <property> <name> javax.jdo.option.ConnectionPassword </name> <value>password</value> <description>metastore password to connect with </description> </property>