You're reading from HBase High Performance Cookbook Solutions for optimization, scaling and performance tuning

Product type Paperback

Published in Jan 2017

Publisher Packt

ISBN-13 9781783983063

Length 350 pages

Edition 1st Edition

Languages

Java

Tools

HBase

Concepts

Database Administration

Author (1):

Ruchir Choudhry

View More author details

Table of Contents (13) Chapters

Preface

1. Configuring HBase FREE CHAPTER

2. Loading Data from Various DBs

3. Working with Large Distributed Systems Part I

4. Working with Large Distributed Systems Part II

5. Working with Scalable Structure of tables

6. HBase Clients

7. Large-Scale MapReduce

Introduction

8. HBase Performance Tuning

9. Performing Advanced Tasks on HBase

10. Optimizing Hbase for Cloud

11. Case Study

Index

Managing clusters

In HBase ecosystem, it's must to monitor the cluster to control and improve their performance and states as it grows. As HBase sits on top of Hadoop ecosystem and serves real-time user traffic, it's essential to see the performance of the cluster at any given point of time, this allows us to detect the problem well in advance and take corrective actions before it happens.

Getting ready

It is important to know some of the details of Ganglia and its distributed components before we get into the details of managing clusters

gmond

This is an acronym for a low footprint service known as Ganglia Monitoring Daemon. This service needs to be installed at each node from where we want to pull the matrix. This daemon is the actual workhorse and collects the data of each host by listen/announce protocol. It also helps collect some of the core metrics such as disk, active process, network, memory, and CPU/VCPUs.

gmetad

This is an acronym for Ganglia meta daemon. It is a service that collects data from other gmetad and gmond and mushes it together into a single meta-cluster image. The format used to store the data is RRD and XML. This enables the client application browsing.

gweb

It's a web interface or a view to the data that is collected by the earlier two services. It's a PHP-based web interface. It requires the following:

Apache web server
PHP 5.2 or later
The PHP json extension

How to do it…

We will divide our how to do it into two sections. In the first section, we will talk about installing Ganglia on all the nodes.

Once it's done, we will do the integration with HBase so that the relevant metrics are available.

Ganglia setup

To install Ganglia it is best to use prebuild binary package that is available from the vendor distributions. This will help in dealing with the pre-requisites libraries. Alternatively, it can be downloaded from the Ganglia website, http://sourceforge.net/projects/ganglia/files/latest/download?source=files.

If you are using browser from command prompt, you can do it by using following command:

wget –o http://downloads.sourceforge.net/project/ganglia/\ganglia%20monitoring%20core/3.0.7%20%28Fossett%29/ganglia-3.0.7.tar.gz

When doing wget, use it as a single line on your shell. Use sudo in case you don't have privilege for the current directory or download it in /tmp and later on copy to the respective location.

tar –xzvf ganglia-3.0.7.tar.gz –c /opt/HBase B
rm –rf ganglia-3.0.7.tar.gz // it will delete the tar file which is not needed now.
Now let's Install the dependencies
```
sudo apt-get –y install build-essential libapr1-dev libconfuse-dev libexpat1-dev python-dev
```
The -y options means that apt-get won't wait for users confirmation. It will assume yes when question for confirmation would appear.

Building and installing the downloaded and exploded binary:

cd /opt/HBase B/ganglia-3.0.7
./configure --- is a configuration command on linux env
make
sudo make install

Once the preceding step is completed, you can generate a default configuration file by:

gmond --default_con
fig > /etc/gmond.conf        --use "sudo su - " in case there is a privilege issue
sudo su – will make you a root user and will allow the system library to be accessed by the gmond.conf

vi /etc/gmond.conf and change the following:
```
globals
{
user=HBase gangila in place of above.
}
```
Note
In case you are using a specific user to perform ganglia task then change the above and add this user as shown above.

The recommendation will be to create this user by the following commands:

sudo adduser --disabled-login --no-create-home ganglia
cluster {
name =HBase B --- name of your cluster will be used 
owner ="HBase B Company"
url =http://yourHBase bMaster.ganglia-monitor.com/
--- url of the main monitor or the CNAME 
}

The UDP setup, which is the default setup, if good for fewer than 120 nodes. For more than 120 nodes, we have to switch to unicast.

The setup is as follows:

Change in /etc/gmond.conf
Udp_send_channel
{
#mcast_join=--your IP address to join in  
host = yourHBase bMaster.ganglia-monitor.com
post=8649
# ttl=1 
}
udp_recv_channel
{
#mcast_join=--your IP address to join in  
port =8649
# bind =--your IP address to join in  
}

Start the monitoring daemon with:
```
sudo gmond
```
We can test it by nc <hostname> 8649 or telnet hostname 8649
Note
You have to kill the daemon thread to stop it using ps –ef | grep gmond. This will provide the process ID with the following process:
Execute sudo kill -9 <PID>
Now we have to install Ganglia meta daemon. It is good to have one if the cluster is less than 100 nodes. This is the workhorse and it will require powerful machine with decent compute power, as these are responsible for creating graphs.

Let's move ahead:

cd  /u/HBase B/ganglia-3.0.7
./configure –-with-gmetad
make
sudo make install
sudo cp /u/HBase B/gangli-3.0.7/gmetad/gmetad.conf /etc/gmetad.conf

Open using sudo vi /etc/gmrtad.conf change the code:

setuid_username "ganglia"
data_source "HBase B"  yourHBase bMaster.ganglia-monitor.com
gridename "<our grid name say HBase B Grid>"

Now we need to create directories, which will store data in a round-robin database (rrds):
```
mkdir –p /var/lib/ganglia/rrds
```
Now let's change the ownership to ganglia users, so that it can read and write as needed.
```
chown –R ganglia:ganglia /var/lib/ganglia/
```
Let's start the daemon:
```
gmetad
```
Note
You have to kill the daemon thread to stop it using ps –ef | grep gmetad. This will provide the process ID with the process.
Execute sudo kill -9 <PID>
Now, let's focus on Ganglia web.
```
sudo apt-get -y install rrdtool apache2 php5-mysql libapache2-mod-php5 php5-gd
```
Tip
Note that this will install rrdtool (round robin database tool), Apache/httpd, php5 connector (apache to mysql), Php5-mysql drivers, and so on.

Copy the PHP-based file to the following locations:

cp –r /u/HBase B/ganglia-3.0.7/web  /var/www/ganglia
sudo /etc/init.d/apache2 restart ( others which can be used are, status, stop )

Point http:// HBase bMaster.ganglia-monitor.com/ganglia, you should start seeing the basic graphs as the HBase setup is still not done.

Integrate HBase and Ganglia:

vi  /u/HBase B/HBase -0.98.5-hadoop2/conf /hadoop-metrics2-HBase .properties

Change the below parameter for getting different status on the ganglia:

HBase .extendedperiod = 3600
HBase .class= org.apache.hadoop.metrics2.sink.FileSink
HBase .period=5
HBase .servers=master2:8649
# jvm context will provide memory used , thread count in JVM etc.
jvm.class= org.apache.hadoop.metrics2.sink.FileSink
jvm.period=5
# enable rpc context to see the metrics on each HBase rpc method invocation.

jvm.servers=master2:8649
rpc.class= org.apache.hadoop.metrics2.sink.FileSink
rpc.period=5
rpc.servers=master2:8649

Copy the /u/HBase B/HBase B/HBase -0.98.5-hadoop2/conf/ hadoop-metrics2-HBase .properties to all the nodes and restart HMaster and all the region servers:

How it works…

As the system grows from a few nodes to the tens or hundreds or becomes a very large cluster having more than hundreds of nodes it's pivotal to have a holistic view, drill down view, historical view of the logs at any given point of time in a graphical representation. In a large or very large installation, administrators are more concerned about redundancy, which avoids single point of failure. HBase and underlying HDFS are designed to handle the node failures gracefully, but it's equally important to monitor these failure as this can lead to pull down the cluster if a corrective action is not taken in time. HBase exposes various matrix to JMX and Ganglia like HMaster, region servers statistics, JMV (Java virtual machines), RPC (Remote procedure calls), Hadoop/HDFS, MapReduce details. Taking into consideration all these points and various other salient and powerful features, we considered Ganglia.

Ganglia provides the following advantages:

It provides near-real-time monitoring for all the vital information of a very large cluster.
It runs on commodity hardware and can be suited for most of the popular OS.
Its open sourced and relatively easy to install.
It integrates easily with traditional monitoring systems
It provides an overall view of all nodes in a grid and all nodes in the cluster.
The monitored data is available in both textual and graphic format.
Works on multicast listen/announce protocol.
Works with open standards.
- JSON
- XML
- XDR
- RRDTool
- APR – Apache portable runtime
- Apache HTTPD server
- PHP-based web interface

HBase works with only 3.0.X and higher version of Ganglia, hence we used 3.0.7 version.

In step 4, we installed the dependencies of libraries, which will be required for the ganglia to compile.

In step 5, we compiled ganglia and installed it by running the configure command, then we used make and then make install command.

In step 6, we created a file gmond.conf, and later on in step 7, we changed the setting to point to HBase master node. We also configured the port to 8649 with a user ganglia who can read from the cluster. By commenting the multicast address and the TTL (time to live), we also changed the UDP-based multicasting to which is a default one to unicasting, which enables us to expand the cluster to above 120 nodes. We also added a master Gmond node in this config file.

In step 8 we started the gmond and got some core monitoring such as CPU, disk, network, memory, and load average of the nodes.

In step 9, we went back to the /u/HBase B/ganglia-3.0.7/ and reran the configuration, but this time, we added configure –with-gmetad, so that it complies with gmetad.

In step 11, we copied the gmetad.conf from.

sudo /u/HBase B/gangli-3.0.7/gmetad/gmetad.conf to /etc/gmetad.conf.

In step 12, we added ganglia user and Master details in the data_source HBase B HBase bMaster.ganglia-monitor.com.

In step 13/14, we create the rrds directory that will hold the data in round-robin databases; later on, we stated the gmetad daemon on the master nodes.

In step 15, we installed all the dependency, which is required to run the web interface.

In step 16, we copied the web .php file from the existing location.

(/u/HBase B/ganglia-3.0.7/web) to ( /var/www/ganglia)

In step 17, we restarted the apache instance and saw all the basic graphs, which provides the details of the nodes and the host but not HBase details. We also copied it to all the nodes so that we have a similar configuration and the Ganglia master is getting the data from the child nodes.

In step 18, we changed the setting in hadoop-metrics2-HBase .properties so that it starts collecting the metrics and starts sending it to the ganglia servers on port 8649. The main class that is responsible for providing these details is org.apache.hadoop.metrics2.sink.FileSink and it properties.

Now we point at the URL of master, and once the page is rendered, it starts showing the graphs as described by the image HBase -Ganglia-MasterAndRegion01-01.png. It starts showing the following graphs:

Memory and CPU usage
JVM details (GC cycle, memory consumed by JVM, threads used, heap consumed, and so on)
HBase Master details
HBase Region compaction queue details
Region server flush queue utilizations
Region servers IO