Getting started with HBase
We will discuss this section in a bit questionnaire manner, and will come to understand HBase with the help of scenarios and conditions.
When it started
The following figure shows the flow of HBase: birth, growth, and current status:
It all started in 2006 when a company called Powerset (later acquired by Microsoft) was looking forward to building a natural language search engine. The person responsible for the development was Jim Kellerman and there were also many other contributors. It was modeled around the Google BigTable white paper that came out in 2006, which was running on Google File System (GFS).
It started with a TAR file with a random bunch of Java files with initial HBase code. It was first added to the contrib
directory of Hadoop as a small subcomponent of Hadoop and with the dedicated effort for filling up gaps; it has slowly and steadily grown into a full-fledged project. It was first added with Hadoop 0.1.0 and as it become more and more feature rich and stable, it was promoted to Hadoop subproject and then slowly with more and more development and contribution from the HBase user and developer group, it has became one of the top-level projects at Apache.
The following figure shows HBase versions from the beginning till now:
Let's have a look at the year-by-year evolution of HBase's important features:
- 2006: The idea of HBase started with the white paper of Google BigTable
- 2007: Data compression on the per column family was made available, addition and deletion of column families online was added, script to start/stop HBase cluster was added, MapReduce connecter was added, HBase shell support was added, support of row and column filter was added, algorithm to distribute region was evenly added, first rest interface was added, and hosted the first HBase meetup
- 2008: HBase 0.1.0 to 0.18.1, HBase moved to new SVN, HBase added as a contribution to Hadoop, HBase become Hadoop subproject, first separate release became available, and Ruby shell added
- 2009: HBase 0.19.0 to 0.20.*, improvement in writing and scanning, batching of writing and scanning, block compression, HTable interface to REST, addition of binary comparator, addition of regular expression filters, and many more
- 2010 till date: HBase 0.89.* - 0.94.*, support for HDFS durability, improvement in import flow, support for MurmurHash3, addition of daemon threads for NameNode and DataNode to indicate the VM or kernel-caused pause in application log, tags support for key value, running MapReduce over snapshot files, introduction of transparent encryption of HBase on disk data, addition of per key-value security, offline rebuilding
.META.
from file system data, snapshot support, and many moreNote
For more information, just visit https://issues.apache.org/jira/browse/HBASE and explore more detailed explanations and more lists of improvements and the addition of new features.
- 0.96 to 1.0 and Future: HBase Version 1 and higher, add utility for adorning HTTP context, fully high availability with Hadoop HA, rolling upgrades, improved failure detection and recovery, cell-level access security, inline cell tagging, quota and grouping, reverse scan, rolling upgrade, and it will be more useful for analytics purposes and helpful for data scientists
Note
While we wait for new features in v1.0; we can always visit http://hbase.apache.org for the latest releases and features.
Here is the link from where we can download HBase versions:
Let's now discuss HBase and Hadoop compatibility and the features they provide together.
Prior to Hadoop v1.0, when DataNode used to crash, HBase Write-Ahead Log—the logfiles that maintain the read/write operation before the final writing is done to the MemStore—would be lost and hence the data too. This version of Hadoop integrated append branch into the main core, which increased the durability for HBase. Hadoop v1.0 has also implemented the facility of disk failure, making RegionServer more robust.
Hadoop v2.0 has integrated high availability of NameNode, which also enables HBase to be more reliable and robust by enabling the multiple HMaster instances. Now with this version of HBase, upgrading has become easy because it is made independent of HDFS upgrades. Let's see in the following table how recent versions of Hadoop have enhanced HBase on the basis of performance, availability, and features:
Criteria |
Hadoop v1.0 |
Hadoop v2.0 |
Hadoop v2.x |
---|---|---|---|
Features |
Durability using |
| |
Performance |
Short-circuit read |
|
|
Note
Miscellaneous features in newer HBase are HBase isolation and allocation, online-automated repair of table integrity and region consistency problems, dynamic configuration changes, reverse scanning (stop row to start row), and many other features; users can visit https://issues.apache.org/jira/browse/HBASE for features and advancement of each HBase release.
HBase components and functionalities
Here let's discuss various components of HBase and their components recursively:
- ZooKeeper
- HMaster
- RegionServer
- Client
- Catalog tables
ZooKeeper
ZooKeeper is a high-performance, centralized, multicoordination service system for distributed application, which provides a distributed synchronization and group service to HBase.
It enables the users and developer to focus on the application logic and not on the coordination with the cluster, for which it provides some API that can be used by the developers to use and implement coordination task such as master server, and managing application and cluster communication system.
In HBase, ZooKeeper is used to elect a cluster master in order to keep track of available and online servers, and to keep the metadata of the cluster. ZooKeeper APIs provide:
- Consistency, ordering, and durability
- Synchronization
- Concurrency for a distributed clustered system
The following figure shows ZooKeeper:
It was developed at Yahoo Research. And the reason behind the name ZooKeeper is that in Hadoop system, projects are based on animal names, and in discussion regarding naming this technology, this name emerged as it manages the availability and coordination between different components of a distributed system.
ZooKeeper not only simplifies the development but also sits on the top distributed system as an abstraction layer to facilitate the better reachability to the components of the system. The following figure shows the request and response flow:
Let's consider a scenario wherein we have a few people who want to fill 10 rooms with some items. One instance would be where we will show how they find their way to the room to keep the items. Some of the rooms will be locked, which will lead the people to move on to other rooms. The other instance would be where we can allocate some representatives with information about the rooms, condition of rooms, and state of rooms (open, closed, fit for storing, not fit, and so on). We can then send them with items to those representatives for the information. The representative will guide the person towards the right room, which is available for storage of items, and the person can directly move to the specified room and store the item. This will not only ease the communication and the storage process but also reduce the overhead from the process. The same technique can be applied in the case of the ZooKeepers.
ZooKeeper maintains a tree with ZooKeeper data internally called a znode. This can be of two types:
- Ephemeral, which is good for applications that need to understand whether a specific distributed resource is available or not.
- The persistent one will be stored till a client does not delete it explicitly and it stores some data of the application too.
Why an odd number of ZooKeepers?
ZooKeepers are based on a majority principle; it requires that we have a quorum of servers to be up, where quorum is ceil(n/2)
, for a cluster of three nodes ensemble means two nodes must be up and running at any point of time, and for five node ensemble, a minimum three nodes must be up. It's also important for election purpose for the ZooKeeper master. We will discuss more options of configuration and coding of ZooKeeper in later chapters.
HMaster
HMaster is the component of the HBase cluster that can be thought of as NameNode in the case of Hadoop cluster; likewise, it acts as a master for RegionServers running on different machines. It is responsible for monitoring all RegionServers in an HBase cluster and also provides an interface to all HBase metadata for the client operations. It also handles RegionServer failover, and region splits.
There may be more than one instance of HMaster in an HBase cluster that provides High Availability (HA). So, if we have more than one master, only one master is active at a time; at the start up time, all the masters compete to become the active master in the cluster and whichever wins becomes the active master of the cluster. Meanwhile, all other master instances remain passive till the active master crashes, shuts down, or loses a lease from the ZooKeeper.
In short, it is a coordination component in an HBase cluster, which also manages and enables us to perform an administrative task on the cluster.
Let's now discuss the flow of starting up the HMaster process:
- Block (do not serve requests) until it becomes active HMaster.
- Finish initialization.
- Enter loop until stopped.
- Do cleansing when it is stopped.
HMaster exports some of the following interfaces that are metadata-based methods to enable us to interact with HBase:
Related to |
Facilities |
---|---|
HBase tables |
Creating table, deleting table, enabling/disabling table, and modifying table |
HBase column families |
Adding columns, modifying columns, and removing columns |
HBase table regions |
Moving regions, assigning regions, and unassign regions |
In HBase, there is a table called .META.
(table name on file system), which keeps all information about regions that is referred by HMaster for information about the data. By default, HMaster runs on port number 60000 and its HTTP Web UI is available on port 60010, which can always be changed according to our need.
HMaster functionalities can be summarized as follows:
- Monitors RegionServers
- Handles RegionServers failover
- Handles metadata changes
- Assignment/unassignment of regions
- Interfaces all metadata changes
- Performs reload balancing in idle time
- It publishes its location to client using ZooKeeper
- HMaster Web UI provides all information about HBase cluster (table, regions, RegionServers and so on)
If a master node goes down
If master goes down, in this scenario, the cluster may continue working normally as clients talk directly to RegionServers. So, cluster may still function steadily. The HBase catalog table (.META.
and -ROOT-
) exists as HBase tables and it's not stored in master resistant memory. However, as master performs critical functions such as RegionServers' failovers and region splits, these functions may be hampered and if not taken care will create a huge setback to the overall cluster functioning, so the master must be started as soon as possible.
So now, Hadoop is HA enabled and thus HBase can always be made HA using multiple HMasters for better availability and robustness, so we can now consider having multiple HMaster.
RegionServer
RegionServers are responsible for holding the actual raw HBase data. Recall that in a Hadoop cluster, a NameNode manages the metadata and a DataNode holds the raw data. Likewise, in HBase, an HBase master holds the metadata and RegionServer's store. These are the servers that hold the HBase data, as we may already know that in Hadoop cluster, NameNode manages the metadata and DataNode holds the actual data. Likewise, in HBase cluster, RegionServers store the raw actual data. As you might guess, a RegionServer is run or is hosted on top of a DataNode, which utilizes the underlying DataNodes at underlying file system, that is, HDFS.
The following figure shows the architecture of RegionServer:
RegionServer performs the following tasks:
- Serving regions(tables) assigned to it
- Handling client read/write requests
- Flushing cache to HDFS
- Maintaining HLogs
- Performing compactions
- Responsible for handling region splits
Components of a RegionServer
The following are the components of RegionServers
- Write-Ahead logs: This is also called edit. When data is read/modified to HBase, it's not directly written in the disk rather it is kept in memory for some time (threshold, which we can configure based on size and time). Keeping this data in memory may result in a loss of data if the machine goes down abruptly. So to solve this, the data is first written in an intermediate file, which is called Write-Ahead logfile and then in memory. So in the case of system failure, data can be reconstructed using this logfile.
- HFile: These are the actual files where the raw data is stored physically on the disk. This is the actual store file.
- Store: Here the HFile is stored. It corresponds to a column family for a table in HBase.
- MemStore: This component is in memory data store; this resides in the main memory and records the current data operation. So, when data is stored in WAL, RegionServers stores key-value in memory store.
- Region: These are the splits of HBase table; the table is divided into regions based on the key and are hosted by RegionServers. There may be different regions in a RegionServer.
We will discuss more about these components in the next chapter.
Client
Client is responsible for finding the RegionServer, which is hosting the particular row (data). It is done by querying the catalog tables. Once region is found, the client directly contacts RegionServers and performs the data operation. Once this information is fetched, it is cached by the client for further fast retrieval. The client can be written in Java or any other language using external APIs.
Catalog tables
There are two tables that maintain the information about all RegionServers and regions. This is a kind of metadata for the HBase cluster. The following are the two catalog tables that exist in HBase:
-ROOT-
: This includes information about the location of.META.
table.META.
: This table holds all regions and their locations
At the beginning of the start up process, the .mMeta
location is set to root from where the actual metadata of tables are read and read/write continues. So, whenever a client wants to connect to HBase and read or write into table, these two tables are referred and information is returned to client for direct read and write to the RegionServers and the regions of the specific table.
Who is using HBase and why?
The following is a list of just a few companies that use HBase in production. There are many companies who are using HBase, so we will list a few and not all.
- Adobe: They have an HBase cluster of 30 nodes and are ready to expand it. They use HBase in several areas from social services to structured data and processing for internal use.
- Facebook: They use it for messaging infrastructure.
- Twitter: They use it for a number of applications including people search, which relies on HBase internally for data generation, and also their operations team uses HBase as a time series database for cluster-wide monitoring/performance data.
- Infolinks: They use it for process advertisement selection and user events for our in-text advertising network.
- StumbleUpon: They use it with MapReduce data source to overcome traditional query speed limits in MySQL.
- Trend Micro: They use it as cloud-based storage.
- Yahoo!: They are use HBase to store document fingerprint for detecting near-duplicates. They have a cluster of a few nodes that run HDFS, MapReduce, and HBase. The table contains millions of rows; we use this for querying duplicated documents with real-time traffic.
- Ancestry.com: This company uses it for DNA analysis.
- UIDAI: This is an Indian government project; they use HBase for storing resident details.
- Apache: They use it for maintaining wiki.
- Mozilla: They are moving Socorro project to HBase.
- eBay: They use HBase for indexing site inventory.
Note
And we can keep listing, but we will stop it here and for further information, please visit http://wiki.apache.org/hadoop/Hbase/PoweredBy.
When should we think of using HBase?
Using HBase is not the solution to all problems; however, it can solve a lot of problems efficiently. The first thing is that we should think about the amount of data; if we have a few million rows and a few read and writes, then we can avoid using it. However, think of billions of columns and thousands of read/write data operations in a short interval, we can surely think of using HBase.
Let's consider an example, Facebook uses HBase for its real-time messaging infrastructure and we can think of how many messages or rows of data Facebook will be receiving per second. Considering that amount of data and I/O, we can currently think of using HBase. The following list details a few scenarios when we can consider using HBase:
- If data needs to have a dynamic or variable schema
- If a number of columns contain more null values (blank columns)
- When we have a huge number of dynamic rows
- If our data contains a variable number of columns
- If we need to maintain versions of data
- If high scalability is needed
- If we need in-built compression on records
- If a high volume of I/O is needed
There are many other cases where we can use HBase and it can be beneficial, which is discussed in later chapters.
When not to use HBase
Let's now discuss some points when we don't compulsorily have to use HBase just because everyone else is using it:
- When data is not in large amounts (in TBs and more)
- When JOINs and relational DB features are needed
- Don't go with the belief "every one is using it"
- If RDBMS fits your requirements, use RDBMS
Understanding some open source HBase tools
The following is the list of some HBase tools that are available in the development world:
- hbaseexplorer: This tool provides UI for HBase; using this tool, we can perform the following operations:
- Data visualization
- Table creation, deletion, and cloning
- Table statistics
- Scans
Note
For reference, go to http://sourceforge.net/projects/hbaseexplorer/.
- Toad for Cloud Databases: This is the tool to connect to HBase and perform various functions.
Note
For reference, go to http://www.toadworld.com/products/toad-for-cloud-databases/default.aspx.
- HareDB HBase Client: This is an HBase client, which can be used more easily with its user-friendly interface, which makes it a GUI tool for HBase (including PIG and high speed Hive Query)
Note
For reference, go to http://sourceforge.net/projects/haredbhbaseclie/.
- hrider: The hrider is a UI application that provides an easier way to view or manipulate the data saved in the HBase.
Note
For reference, go to https://github.com/NiceSystems/hrider.
- Hannibal: This is a tool for Apache HBase region monitoring.
Note
For reference, go to https://github.com/sentric/hannibal.
- Performance Monitoring & Alerting (SPM): SPM is a proactive performance monitoring, anomaly detection, and alerting solution available in the Cloud (SaaS) as well as own premise. SPM can monitor Solr, Elasticsearch, Hadoop, HBase, ZooKeeper, Kafka, Storm, Redis, JVM, system metrics, custom metrics, and more.
Note
For reference, go to http://sematext.com/spm/.
- Phoenix: This tool is a SQL skin over HBase, delivered as a client-embedded JDBC driver, powering the HBase use cases at Salesforce.com. Phoenix targets low-latency queries (milliseconds), as opposed to batch operation via MapReduce.
Note
For reference, go to https://github.com/forcedotcom/phoenix and http://phoenix.apache.org/.
- Impala: Cloudera Impala is a parallel processing SQL query engine, which runs in Apache Hadoop. The Apache-licensed, open source Impala project combines modern, scalable, parallel database technology with the power of Hadoop. Users can directly query data stored in HDFS and Apache HBase without requiring data movement or transformation.
The Hadoop-HBase version compatibility table
As there are compatibility issues in almost all systems; likewise, HBase also has compatibility issues with Hadoop versions, which means all versions of HBase can't be used use on top of all Hadoop versions. The following is the version compatibility of Hadoop-HBase that should be kept in mind while configuring HBase on Hadoop (credit: Apache):
Hadoop versions |
HBase 0.92.x |
HBase 0.94.x |
HBase 0.96.0 |
HBase 0.98.0 |
---|---|---|---|---|
Hadoop 0.20.205 |
Supported |
Not supported |
Not supported |
Not supported |
Hadoop 0.22.x |
Supported |
Not supported |
Not supported |
Not supported |
Hadoop 1.0.0-1.0.2 |
Supported |
Supported |
Not supported |
Not supported |
Hadoop 1.0.3+ |
Supported |
Supported |
Supported |
Not supported |
Hadoop 1.1.x |
Not tested enough |
Supported |
Supported |
Not supported |
Hadoop 0.23.x |
Not supported |
Supported |
Not tested enough |
Not supported |
Hadoop 2.0.x-alpha |
Not supported |
Not tested enough |
Not supported |
Not supported |
Hadoop 2.1.0-beta |
Not supported |
Not tested enough |
Supported |
Not supported |
Hadoop 2.2.0 |
Not supported |
Not tested enough |
Supported |
Supported |
Hadoop 2.x |
Not supported |
Not tested enough |
Supported |
Supported |
We can always visit https://hbase.apache.org for more updated version compatibility between HBase and Hadoop.