This article by Shashwat Shriparv, author of the book, Learning HBase, will introduce you to the world of HBase.
(For more resources related to this topic, see here.)
HBase is a horizontally scalable, distributed, open source, and a sorted map database. It runs on top of Hadoop file system that is Hadoop Distributed File System (HDFS). HBase is a NoSQL nonrelational database that doesn't always require a predefined schema. It can be seen as a scaling flexible, multidimensional spreadsheet where any structure of data is fit with on-the-fly addition of new column fields, and fined column structure before data can be inserted or queried. In other words, HBase is a column-based database that runs on top of Hadoop distributed file system and supports features such as linear scalability (scale out), automatic failover, automatic sharding, and more flexible schema.
HBase is modeled on Google BigTable. It was inspired by Google BigTable, which is a compressed, high-performance, proprietary data store built on the Google filesystem. HBase was developed as a Hadoop subproject to support storage of structural data, which can take advantage of most distributed files systems (typically, the Hadoop Distributed File System known as HDFS).
The following table contains key information about HBase and its features:
Features |
Description |
Developed by |
Apache |
Written in |
Java |
Type |
Column oriented |
License |
Apache License |
Lacking features of relational databases |
SQL support, relations, primary, foreign, and unique key constraints, normalization |
Website |
|
Distributions |
Apache, Cloudera |
Download link |
|
Mailing lists |
|
Blog |
The following figure represents the layout information of HBase on top of Hadoop:
There is more than one ZooKeeper in the setup, which provides high availability of master status; a RegionServer may contain multiple rations. The RegionServers run on the machines where DataNodes run. There can be as many RegionServers as DataNodes. RegionServers can have multiple HRegions; one HRegion can have one HLog and multiple HFiles with its associate's MemStore.
HBase can be seen as a master-slave database where the master is called HMaster, which is responsible for coordination between client application and HRegionServer. It is also responsible for monitoring and recording metadata changes and management. Slaves are called HRegionServers, which serve the actual tables in form of regions. These regions are the basic building blocks of the HBase tables, which contain distribution of tables. So, HMaster and RegionServer work in coordination to serve the HBase tables and HBase cluster.
Usually, HMaster is co-hosted with Hadoop NameNode daemon process on a server and communicates to DataNode daemon for reading and writing data on HDFS. The RegionServer runs or is co-hosted on the Hadoop DataNodes.
Let's list the major differences between relational databases and HBase:
Relational databases |
HBase |
Uses tables as databases |
Uses regions as databases |
Filesystems supported are FAT, NTFS, and EXT |
Filesystem supported is HDFS |
The technique used to store logs is commit logs |
The technique used to store logs is Write-Ahead Logs (WAL) |
The reference system used is coordinate system |
The reference system used is ZooKeeper |
Uses the primary key |
Uses the row key |
Partitioning is supported |
Sharding is supported |
Use of rows, columns, and cells |
Use of rows, column families, columns, and cells |
Let's see the major features of HBase that make it one of the most useful databases for the current and future industry:
You can search for the Package org.apache.hadoop.hbase.mapreduce for more details.
Let's see where HBase sits in the Hadoop ecosystem. In the Hadoop ecosystem, HBase provides a persistent, structured, schema-based data store. The following figure illustrates the Hadoop ecosystem:
HBase can work as a separate entity on the local filesystem (which is not really effective as no distribution is provided) as well as in coordination with Hadoop as a separate but connected entity. As we know, Hadoop provides two services, a distributed files system (HDFS) for storage and a MapReduce framework for processing in a parallel mode. When there was a need to store structured data (data in the form of tables, rows and columns), which most of the programmers are already familiar with, the programmers were finding it difficult to process the data that was stored on HDFS as an unstructured flat file format. This led to the evolution of HBase, which provided a way to store data in a structural way.
Consider that we have got a CSV file stored on HDFS and we need to query from it. We would need to write a Java code for this, which wouldn't be a good option. It would be better if we could specify the data key and fetch the data from that file. So, what we can do here is create a schema or table with the same structure of CSV file to store the data of the CSV file in the HBase table and query using HBase APIs, or HBase shell using key.
Let's look into the representation of rows and columns in HBase table:
An HBase table is divided into rows, column families, columns, and cells. Row keys are unique keys to identify a row, column families are groups of columns, columns are fields of the table, and the cell contains the actual value or the data.
So, we have been through the introduction of HBase; now, let's see what Hadoop and its components are in brief. It is assumed here that you are already familiar with Hadoop; if not, following a brief introduction about Hadoop will help you to understand it.
Hadoop is an underlying technology of HBase, providing high availability, fault tolerance, and distribution. It is an Apache-sponsored, free, open source, Java-based programming framework which supports large dataset storage. It provides distributed file system and MapReduce, which is a distributed programming framework. It provides a scalable, reliable, distributed storage and development environment. Hadoop makes it possible to run applications on a system with tens to tens of thousands of nodes. The underlying distributed file system provides large-scale storage, rapid data access. It has the following submodules:
Other Hadoop subprojects are HBase, Hive, Ambari, Avro, Cassandra (Cassandra isn't a Hadoop subproject, it's a related project; they solve similar problems in different ways), Mahout, Pig, Spark, ZooKeeper (ZooKeeper isn't a Hadoop subproject. It's a dependency shared by many distributed systems), and so on. All of these have different usability and the combination of all these subprojects forms the Hadoop ecosystem.
The following are the core daemons of Hadoop:
The preceding are the daemons in the case of Hadoop v1 or earlier. In newer versions of Hadoop, we have ResourceManager instead of JobTracker, the node manager instead of TaskTrackers, and the YARN framework instead of a simple MapReduce framework. The following is the comparison between daemons in Hadoop 1 and Hadoop 2:
Hadoop 1 |
Hadoop 2 |
HDFS |
|
|
|
Processing |
|
|
|
As we now know what HBase and what Hadoop are, let's have a comparison between HDFS and HBase for better understanding:
Hadoop/HDFS |
HBase |
This provide filesystem for distributed storage |
This provides tabular column-oriented data storage |
This is optimized for storage of huge-sized files with no random read/write of these files |
This is optimized for tabular data with random read/write facility |
This uses flat files |
This uses key-value pairs of data |
The data model is not flexible |
Provides a flexible data model |
This uses file system and processing framework |
This uses tabular storage with built-in Hadoop MapReduce support |
This is mostly optimized for write-once read-many |
This is optimized for both read/write many |
So in this article, we discussed the introductory aspects of HBase and it's features. We have also discussed HBase's components and their place in the HBase ecosystem.
Further resources on this subject: