In this article by Ruchir Choudhry, the author of the book HBase High Performance Cookbook, we will cover the configuration and deployment of HBase.
(For more resources related to this topic, see here.)
HBase is an open source, nonrelational, column-oriented distributed database modeled after Google's Cloud BigTable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project, and it runs on top of Hadoop Distributed File System (HDFS), providing BigTable-like capabilities for Hadoop. It's a column-oriented database, which is empowered by a fault-tolerant distributed file structure knows as HDFS. In addition to this, it also provides advanced features, such as auto sharding, load balancing, in-memory caching, replication, compression, near real-time lookups, strong consistency (using multiversions), block caches, and bloom filters for real-time queries and an array of client APIs.
Throughout the chapter, we will discuss how to effectively set up mid and large size HBase clusters on top of the Hadoop and HDFS framework. This article will help you set up an HBase on a fully distributed cluster.
For the cluster setup, we will consider redhat-6.2 Linux 2.6.32-220.el6.x86_64 #1 SMP Wed Nov 9 08:03:13 EST 2011 x86_64 x86_64 GNU/Linux, which will have six nodes.
Before we start HBase in a fully distributed mode, we will first be setting up Hadoop-2.4.0 in a distributed mode, and then, on top of a Hadoop cluster, we will set up HBase because it stores data in Hadoop Distributed File System (HDFS).
Check the permissions of the users; HBase must have the ability to create a directory.
Let's create two directories in which the data for NameNode and DataNode will reside:
drwxrwxr-x 2 app app 4096 Jun 19 22:22 NameNodeData
drwxrwxr-x 2 app app 4096 Jun 19 22:22 DataNodeData
-bash-4.1$ pwd
/u/HbaseB/hadoop-2.4.0
-bash-4.1$ ls -lh
total 60K
drwxr-xr-x 2 app app 4.0K Mar 31 08:49 bin
drwxrwxr-x 2 app app 4.0K Jun 19 22:22 DataNodeData
drwxr-xr-x 3 app app 4.0K Mar 31 08:49 etc
Following are the steps to install and configure HBase:
We will require the following components for NameNode:
Components |
Details |
Type of systems |
An operating system |
redhat-6.2 Linux 2.6.32-220.el6.x86_64 #1 SMP Wed Nov 9 08:03:13 EST 2011 x86_64 x86_64 GNU/Linux, or other standard linux kernel. |
|
Hardware/CPUS |
16 to 24 CPUS cores. |
NameNode /Secondry NameNode. |
Hardware/RAM |
64 to 128 GB. In special cases, 128 GB to 512 GB RAM. |
NameNode/Secondry NameNodes. |
Hardware/storage |
Both NameNode servers should have highly reliable storage for their namespace storage and edit log journaling. Typically, hardware RAID and/or reliable network storage are justifiable options. Note that the previous commands including an onsite disk replacement option in your support contract so that a failed RAID disk can be replaced quickly. |
NameNode/Secondry Namenodes. |
RAID: Raid is nothing but a Random Access Inexpensive Drive or Independent Disk; there are many levels of RAID drives, but for Master or NameNode, RAID-1 will be enough.
JBOD: This stands for Just a Bunch of Disk. The design is to have multiple hard drives stacked over each other with no redundancy. The calling software needs to take care of the failure and redundancy. In essence, it works as a single logical volume.
The following screenshot shows the working mechanism of RAID and JBOD:
Before we start for the cluster setup, a quick recap of the Hadoop setup is essential, with brief descriptions.
Let's create a directory where you will have all the software components to be downloaded:
-bash-4.1$ ls -ltr
total 32
drwxr-xr-x 9 app app 4096 Oct 7 2013 hadoop-2.2.0
drwxr-xr-x 10 app app 4096 Feb 20 10:58 zookeeper-3.4.6
drwxr-xr-x 15 app app 4096 Apr 5 08:44 pig-0.12.1
drwxrwxr-x 7 app app 4096 Jun 30 00:57 hbase-0.98.3-hadoop2
drwxrwxr-x 8 app app 4096 Jun 30 00:59 apache-hive-0.13.1-bin
drwxrwxr-x 7 app app 4096 Jun 30 01:04 mahout-distribution-0.9
Make sure that you have adequate privileges in the folder to add, edit, and execute a command. Also, you must set up password-less communication between different machines, such as from the name node to DataNode and from HBase Master to all the region server nodes. Refer to this webpage to learn how to do this: http://www.debian-administration.org/article/152/Password-less_logins_with_OpenSSH.
configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://mynamenode-hadoop:9001</value>
<description>The name of the default file system.
</description>
</property>
</configuration>
You can specify a port that you want to use; it should not clash with the ports that are already in use by the system for various purposes. A quick look at this link can provide more specific details about this; complete detail on this topic is out of the scope of this book. You can refer to http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers.
<configuration>
<property>
<name>fs.checkpoint.dir</name>
<value>/u/dn001/hadoop/hdf/secdn
/u/dn002/hadoop/hdfs/secdn
</value>
<description>A comma separated list of paths. Use the
list of directories from $FS_CHECKPOINT_DIR.
example, /u/dn001/hadoop/hdf/secdn,/u/dn002/hadoop/hdfs/secd
n
</description>
</property>
</configuration>
The separation of the directory structure is for the purpose of the clean separation of the hdfs block separation and to keep the configurations as simple as possible. This also allows us to do proper maintenance.
<property>
<name>dfs.name.dir</name>
<value>
/u/nn01/hadoop/hdfs/nn/u/nn02/hadoop/hdfs/nn
</value>
<description>
Comma separated list of path, Use the list of
directories
</description>
</property>
for DataNode:
<property>
<name>dfs.data.dir</name>
<value>/u/dnn01/hadoop/hdfs/dn,/u/dnn02/hadoop/hdfs/dn
</value>
<description>Comma separated list of path, Use the list of
directories
</description>
</property>
<property>
<name>dfs.http.address</name>
<value>namenode.full.hostname:50070</value>
<description>Enter your NameNode hostname for http
access. </description>
</property>
<property>
<name>dfs.secondary.http.address</name>
<value>
secondary.namenode.full.hostname:50090
</value>
<description>
Enter your Secondary NameNode hostname.
</description>
</property>
We can go for an HTTPS setup for NameNode as well, but let's keep this optional for now:
<property>
<name>yarn.resourcemanager.resourcetracker.address</name>
<value>yarnresourcemanager.full.hostname:8025</value>
<description>Enter your yarn Resource Manager hostname.</description>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>resourcemanager.full.hostname:8030</value>
<description>Enter your ResourceManager hostname</description>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>resourcemanager.full.hostname:8050</value>
<description>Enter your ResourceManager hostname.</description>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>resourcemanager.full.hostname:8041</value>
<description>Enter your ResourceManager hostname.</description>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/u/dnn01/hadoop/hdfs /yarn,/u/dnn02/hadoop/hdfs/yarn </value>
<description>Comma separated list of paths. Use the list of directories from,.</description>
</property>
<property>
<name>yarn.nodemanager.logdirs</name>
<value>/u/var/log/hadoop/yarn</value>
<description>Use the list of directories from $YARN_LOG_DIR. <description>
</property>
This completes the configuration changes required for Yarn
<property>
<name>mapreduce.jobhistory.address</name>
<value>jobhistoryserver.full.hostname:10020</value>
<description>Enter your JobHistoryServer hostname.</description>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://hbase.namenode.full.hostname:8020/apps/hbase/data</value>
<description>
Enter the HBase NameNode server hostname</description>
</property>
<property>
<!—this id for binding address -->
<name>hbase.master.info.bindAddress</name>
<value>$hbase.master.full.hostname</value>
<description>Enter the HBase Master server hostname</description>
</property>
This competes the HBase changes.
yourzooKeeperserver.1=zoo1:2888:3888
yourZooKeeperserver.2=zoo2:2888:3888
If you want to test this setup locally, use different port combinations.
Atomic broadcasting is an atomic messaging system that keeps all the servers in sync and provides reliable delivery, total orders, casual orders, and so on.
RegionServer1
RegionServer2
RegionServer3
RegionServer4
Sudo su $HDFS_USER
/u/HbaseB/hadoop-2.2.0/bin/hadoop namenode -format /u/HbaseB/hadoop-2.4.0/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode
Sudo su $HDFS_USER
/u/HbaseB/hadoop-2.2.0/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start secondarynamenode
Sudo su $HDFS_USER
/u/HbaseB/hadoop-2.2.0/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start datanode
Test 01>
See if you can reach from your browser
http://namenode.full.hostname:50070
Test 02> sudo su $HDFS_USER
/u/HbaseB/hadoop-2.2.0/sbin/hadoop dfs -copyFromLocal /tmp/hello.txt
/u/HbaseB/hadoop-2.2.0/sbin/hadoop dfs –ls
you must see hello.txt once the command executes.
Test 03> Browse
http://datanode.full.hostname:50075/browseDirectory.jsp?namenodeInfoPort=50070&dir=/&nnaddr=$datanode.full.hostname:8020
you should see the details on the datanode.
<login as $YARN_USER and source the directories.sh companion script> /u/HbaseB/hadoop-2.2.0/sbin /yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager
<login as $YARN_USER and source the directories.sh companion script>
/usr/lib/hadoop-yarn/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager
hadoop fs -mkdir /app-logs
hadoop fs -chown $YARN_USER /app-logs
hadoop fs -chmod 1777 /app-logs
Execute MapReduce
Sudo su $HDFS_USER
/u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -mkdir -p /mapred/history/done_intermediate
/u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -chmod -R 1777 /mapred/history/done_intermediate
/u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -mkdir -p /mapred/history/done
/u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -chmod -R 1777 /mapred/history/done
/u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -chown -R mapred /mapred
export HADOOP_LIBEXEC_DIR=/u/HbaseB/hadoop-2.2.0/libexec/
export HADOOP_MAPRED_HOME=/=/u/HbaseB/hadoop-2.2.0/hadoop-mapreduce
export HADOOP_MAPRED_LOG_DIR==/u/HbaseB/hadoop-2.2.0//mapred
<login as $MAPRED_USER and source the directories.sh companion script>
/u/HbaseB/hadoop-2.2.0/sbin/mr-jobhistory-daemon.sh start historyserver --config $HADOOP_CONF_DIR
Test 01: from the browser or from curl use the link to browse.
http://resourcemanager.full.hostname:8088/
Test 02:
Sudo su $HDFS_USER
/u/HbaseB/hadoop-2.2.0/bin/hadoop jar /u/HbaseB/hadoop-2.2.0/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.2.1-alpha.jar teragen 100 /test/10gsort/input
/u/HbaseB/hadoop-2.2.0/bin/hadoop jar /u/HbaseB/hadoop-2.2.0/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.2.1-alpha.jar
/u/HbaseB/hadoop-2.2.0/bin/hadoop fs –mkdir /apps/hbase
/u/HbaseB/hadoop-2.2.0/bin/hadoop fs –chown –R /apps/hbase
/u/HbaseB/hbase-0.98.3-hadoop2/bin/hbas-daemon.sh –-config $HBASE_CONF_DIR start master
this will start the master node
/u/HbaseB/hbase-0.98.3-hadoop2/bin/hbase-daemon.sh –config $HBASE_CONF_DIR start regionservers
this will start the regionservers
For single machine direct sudo ./hbase master start can also be used. Please check the logs in case of any logs.
Now lets login using
Sudo su- $HBASE_USER
./hbase shell
will connect us to the hbase to the master.
-bash-4.1$ sudo ./zkServer.sh start
JMX enabled by default
Using config: /u/HbaseB/zookeeper-3.4.6/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
You can also pipe the log to the ZooKeeper logs.
/u/logs//u/HbaseB/zookeeper-3.4.6/zoo.out 2>&1
In this article, we learned how to configure and set up HBase. We set up HBase to store data in Hadoop Distributed File System.
We explored the working structure of RAID and JBOD and the differences between both filesystems.
Further resources on this subject: