Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Learning Hbase
Learning Hbase

Learning Hbase: Learn the fundamentals of HBase administration and development with the help of real-time scenarios

Arrow left icon
Profile Icon Shriparv
Arrow right icon
Can$12.99 Can$49.99
Full star icon Full star icon Full star icon Full star icon Half star icon 4.1 (8 Ratings)
eBook Nov 2014 326 pages 1st Edition
eBook
Can$12.99 Can$49.99
Paperback
Can$61.99
Subscription
Free Trial
Arrow left icon
Profile Icon Shriparv
Arrow right icon
Can$12.99 Can$49.99
Full star icon Full star icon Full star icon Full star icon Half star icon 4.1 (8 Ratings)
eBook Nov 2014 326 pages 1st Edition
eBook
Can$12.99 Can$49.99
Paperback
Can$61.99
Subscription
Free Trial
eBook
Can$12.99 Can$49.99
Paperback
Can$61.99
Subscription
Free Trial

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Learning Hbase

Chapter 1. Understanding the HBase Ecosystem

HBase is a horizontally scalable, distributed, open source, and a sorted map database. It runs on top of Hadoop file system that is Hadoop Distributed File System (HDFS). HBase is a NoSQL nonrelational database that doesn't always require a predefined schema. It can be seen as a scaling flexible, multidimensional spreadsheet where any structure of data is fit with on-the-fly addition of new column fields, and fined column structure before data can be inserted or queried. In other words, HBase is a column-based database that runs on top of Hadoop distributed file system and supports features such as linear scalability (scale out), automatic failover, automatic sharding, and more flexible schema.

HBase is modeled on Google BigTable. It was inspired by Google BigTable, which is compressed, high-performance, proprietary data store built on the Google file system. HBase was a developed as a Hadoop subproject to support storage of structural data, which can take advantage of most distributed files systems (typically, the Hadoop Distributed File System known as HDFS).

The following table contains key information about HBase and its features:

Features

Description

Developed by

Apache

Written in

Java

Type

Column oriented

License

Apache License

Lacking features of relational databases

SQL support, relations, primary, foreign, and unique key constraints, normalization

Website

http://hbase.apache.org

Distributions

Apache, Cloudera

Download link

http://mirrors.advancedhosters.com/apache/hbase/

Mailing lists

Blog

http://blogs.apache.org/hbase/

HBase layout on top of Hadoop

The following figure represents the layout information of HBase on top of Hadoop:

HBase layout on top of Hadoop

There is more than one ZooKeeper in the setup, which provides high availability of master status; a RegionServer may contain multiple rations. The RegionServers run on the machines where DataNodes run. There can be as many RegionServers as DataNodes. RegionServers can have multiple HRegions; one HRegion can have one HLog and multiple HFiles with its associate's MemStore.

HBase can be seen as a master-slave database where the master is called HMaster, which is responsible for coordination between client application and HRegionServer. It is also responsible for monitoring and recording metadata changes and management. Slaves are called HRegionServers, which serve the actual tables in form of regions. These regions are the basic building blocks of the HBase tables, which contain distribution of tables. So, HMaster and RegionServer work in coordination to serve the HBase tables and HBase cluster.

Usually, HMaster is co-hosted with Hadoop NameNode daemon process on a server and communicates to DataNode daemon for reading and writing data on HDFS. The RegionServer runs or is co-hosted on the Hadoop DataNodes.

Comparing architectural differences between RDBMs and HBase

Let's list the major differences between relational databases and HBase:

Relational databases

HBase

Uses tables as databases

Uses regions as databases

File systems supported are FAT, NTFS, and EXT

File system supported is HDFS

The technique used to store logs is commit logs

The technique used to store logs is Write-Ahead Logs (WAL)

The reference system used is coordinate system

The reference system used is ZooKeeper

Uses the primary key

Uses the row key

Partitioning is supported

Sharding is supported

Use of rows, columns, and cells

Use of rows, column families, columns, and cells

HBase features

Let's see the major features of HBase that make it one of the most useful databases for the current and future industry:

  • Automatic failover and load balancing: HBase runs on top of HDFS, which is internally distributed and automatically recovered using multiple block allocation and replications. It works with multiple HMasters and region servers. This failover is also facilitated using HBase and RegionServer replication.
  • Automatic sharding: An HBase table is made up of regions that are hosted by RegionServers and these regions are distributed throughout the RegionServers on different DataNodes. HBase provides automatic and manual splitting of these regions to smaller subregions, once it reaches a threshold size to reduce I/O time and overhead.
  • Hadoop/HDFS integration: It's important to note that HBase can run on top of other file systems as well. While HDFS is the most common choice as it supports data distribution and high availability using distributed Hadoop, for which we just need to set some configuration parameters and enable HBase to communicate to Hadoop, an out-of-the-box underlying distribution is provided by HDFS.
  • Real-time, random big data access: HBase uses log-structured merge-tree (LSM-tree) as data storage architecture internally, which merges smaller files to larger files periodically to reduce disk seeks.
  • MapReduce: HBase has a built-in support of Hadoop MapReduce framework for fast and parallel processing of data stored in HBase.

    Note

    You can search for the package org.apache.hadoop.hbase.mapreduce for more details.

  • Java API for client access: HBase has a solid Java API support (client/server) for easy development and programming.
  • Thrift and a RESTtful web service: HBase not only provides a thrift and RESTful gateway but also web service gateways for integrating and accessing HBase besides Java code (HBase Java APIs) for accessing and working with HBase.
  • Support for exporting metrics via the Hadoop metrics subsystem: HBase provides Java Management Extensions (JMX) and exporting matrix for monitoring purposes with tools such as Ganglia and Nagios.
  • Distributed: HBase works when used with HDFS. It provides coordination with Hadoop so that distribution of tables, high availability, and consistency is supported by it.
  • Linear scalability (scale out): Scaling of HBase is not scale up but scale out, which means that we don't need to make servers more powerful but we add more machines to its cluster. We can add more nodes to the cluster on the fly. As soon as a new RegionServer node is up, the cluster can begin rebalancing, start the RegionServer on the new node, and it is scaled up, it is as simple as that.
  • Column oriented: HBase stores each column separately in contrast with most of the relational databases, which uses stores or are row-based storage. So in HBase, columns are stored contiguously and not the rows. More about row- and column-oriented databases will follow.
  • HBase shell support: HBase provides a command-line tool to interact with HBase and perform simple operations such as creating tables, adding data, and scanning data. This also provides full-fledged command-line tool using which we can interact with HBase and perform operations such as creating table, adding data, removing data, and a few other administrative commands.
  • Sparse, multidimensional, sorted map database: HBase is a sparse, multidimensional, sorted map-based database, which supports multiple versions of the same record.
  • Snapshot support: HBase supports taking snapshots of metadata for getting the previous or correct state form of data.

HBase in the Hadoop ecosystem

Let's see where HBase sits in the Hadoop ecosystem. In the Hadoop ecosystem, HBase provides a persistent, structured, schema-based data store. The following figure illustrates the Hadoop ecosystem:

HBase in the Hadoop ecosystem

HBase can work as a separate entity on the local file system (which is not really effective as no distribution is provided) as well as in coordination with Hadoop as a separate but connected entity. As we know, Hadoop provides two services, a distributed files system (HDFS) for storage and a MapReduce framework for processing in a parallel mode. When there was a need to store structured data (data in the form of tables, rows and columns), which most of the programmers are already familiar with, the programmers were finding it difficult to process the data that was stored on HDFS as an unstructured flat file format. This led to the evolution of HBase, which provided a way to store data in a structural way.

Consider that we have got a CSV file stored on HDFS and we need to query from it. We would need to write a Java code for this, which wouldn't be a good option. It would be better if we could specify the data key and fetch the data from that file. So, what we can do here is create a schema or table with the same structure of CSV file to store the data of the CSV file in the HBase table and query using HBase APIs, or HBase shell using key.

Data representation in HBase

Let's look into the representation of rows and columns in HBase table:

Data representation in HBase

An HBase table is divided into rows, column families, columns, and cells. Row keys are unique keys to identify a row, column families are groups of columns, columns are fields of the table, and the cell contains the actual value or the data.

So, we have been through the introduction of HBase; now, let's see what Hadoop and its components are in brief. It is assumed here that you are already familiar with Hadoop; if not, following a brief introduction about Hadoop will help you to understand it.

Hadoop

Hadoop is an underlying technology of HBase, providing high availability, fault tolerance, and distribution. It is an Apache-sponsored, free, open source, Java-based programming framework which supports large dataset storage. It provides distributed file system and MapReduce, which is a distributed programming framework. It provides a scalable, reliable, distributed storage and development environment. Hadoop makes it possible to run applications on a system with tens to tens of thousands of nodes. The underlying distributed file system provides large-scale storage, rapid data access. It has the following submodules:

  • Hadoop Common: This is the core component that supports the other Hadoop modules. It is like the master components facilitating communication and coordination between different Hadoop modules.
  • Hadoop distributed file system: This is the underlying distributed file system, which is abstracted on the top of the local file system that provides high throughput of read and write operations of data on Hadoop.
  • Hadoop YARN: This is the new framework that is shipped with newer releases of Hadoop. It provides job scheduling and job and resource management.
  • Hadoop MapReduce: This is the Hadoop-based processing system that provides parallel processing of large data and datasets.

Other Hadoop subprojects are HBase, Hive, Ambari, Avro, Cassandra (Cassandra isn't a Hadoop subproject, it's a related project; they solve similar problems in different ways), Mahout, Pig, Spark, ZooKeeper (ZooKeeper isn't a Hadoop subproject. It's a dependency shared by many distributed systems), and so on. All of these have different usability and the combination of all these subprojects forms the Hadoop ecosystem.

Core daemons of Hadoop

The following are the core daemons of Hadoop:

  • NameNode: This stores and manages all metadata about the data present on the cluster, so it is the single point of contact to Hadoop. In the new release of Hadoop, we have an option of more than one NameNode for high availability.
  • JobTracker: This runs on the NameNode and performs the MapReduce of the jobs submitted to the cluster.
  • SecondaryNameNode: This maintains the backup of metadata present on the NameNode, and also records the file system changes.
  • DataNode: This will contain the actual data.
  • TaskTracker: This will perform tasks on the local data assigned by the JobTracker.

The preceding are the daemons in the case of Hadoop v1 or earlier. In newer versions of Hadoop, we have ResourceManager instead of JobTracker, the node manager instead of TaskTrackers, and the YARN framework instead of a simple MapReduce framework. The following is the comparison between daemons in Hadoop 1 and Hadoop 2:

Hadoop 1

Hadoop 2

HDFS

  • NameNode
  • Secondary NameNode
  • DataNode
  • NameNode (more than one active/standby)
  • Checkpoint node
  • DataNode

Processing

  • MapReduce v1
  • JobTracker
  • TaskTracker
  • YARN (MRv2)
  • ResourceManager
  • NodeManager
  • Application Master

Comparing HBase with Hadoop

As we now know what HBase and what Hadoop are, let's have a comparison between HDFS and HBase for better understanding:

Comparing HBase with Hadoop

Hadoop/HDFS

HBase

This provide file system for distributed storage

This provides tabular column-oriented data storage

This is optimized for storage of huge-sized files with no random read/write of these files

This is optimized for tabular data with random read/write facility

This uses flat files

This uses key-value pairs of data

The data model is not flexible

Provides a flexible data model

This uses file system and processing framework

This uses tabular storage with built-in Hadoop MapReduce support

This is mostly optimized for write-once read-many

This is optimized for both read/write many

Comparing functional differences between RDBMs and HBase

Lately, we are hearing about NoSQL databases such as HBase, so let's just understand what actually HBase has and lacks in comparison to conventional relational databases that have existed for so long now. The following table differentiates it well:

Relational database

HBase

This supports scale up. In other words, when more disk and memory processing power is needed, we need to upgrade it to a more powerful server.

This supports scale out. In other words, when more disk and memory processing power is needed, we need not upgrade the server. However, we need to add new servers to the cluster.

This uses SQL queries for reading records from tables.

This uses APIs and MapReduce for accessing data from HBase tables.

This is row oriented, that is, each row is a contiguous unit of page.

This is column oriented, that is, each column is a contiguous unit of page.

The amount of data depends on configuration of server.

The amount of data does not depend on the particular machine but the number of machines.

It's Schema is more restrictive.

Its schema is flexible and less restrictive.

This has ACID support.

There is no built-in support for HBase.

This is suited for structured data.

This is suited to both structured and nonstructural data.

Conventional relational database is mostly centralized.

This is always distributed.

This mostly guarantees transaction integrity.

There is no transaction guaranty in HBase.

This supports JOINs.

This does not support JOINs.

This supports referential integrity.

There is no in-built support for referential integrity.

So with these differences, both have their own usability and use cases. When we have a small amount of data that can be accommodated in RDBMS without performance lagging, we can go with RDBMS.

When we need more Online Transaction Processing (OLTP) and the transaction type of processing, RDBMS is easy to go. When we have a huge amount of data (in terabytes and petabytes), we should look towards HBase, which is always better for aggregation on columns and faster processing.

We have gone through the word, column-oriented database, in the previous introduction; now let's discuss the difference between the column-oriented databases and the row-oriented databases, which are the traditional relational databases.

These column-oriented database systems have been shown to perform more than an order of magnitude, better than traditional row-oriented database systems on analytical workloads found in data warehouse systems, decision system, and business intelligence applications. These are more I/O-efficient for write-once read-many queries.

Logical view of row-oriented databases

The following figure shows how data is represented in relational databases:

Logical view of row-oriented databases

Logical view of column-oriented databases

The following figure shows how logically we can represent NoSQL/column-oriented databases such as HBase:

Logical view of column-oriented databases

Row-oriented data stores store rows in a contiguous unit on the page, and the number of rows are packed into a page. They are much faster for small numbers of rows and slow for aggregation. On the contrary, column-oriented data stores columns in a contiguous unit on the page, columns may extend up to millions of entries, so they run for many pages. These are much faster for aggregation and analytics. The root of column-oriented database systems can be traced to the 1970 when transposed file first appeared. Column-oriented data stores are better for compression than row-oriented data stores. The following is the comparison between these two:

Row-oriented data stores

Column-oriented data stores

These are efficient for addition/modification of records

These are efficient for reading data

They read pages containing entire rows

They read only needed columns

These are best for OLTP

These are not so optimized for OLTP yet

This serializes all the values in a row together, then the value in the next row, and so on

This serializes all the value of columns together and so on

Row data are stored in contiguous pages in memory or on disk

Columns are stored in pages in memory or on disk

Suppose the records of a table are stored in the pages of memory. When they need to be accessed, these pages are brought to the primary memory, if they are not already present in the memory.

If one row occupies a page and we need all specific column such as salary or rate of interest from all the rows for some kind of analytics, each page containing the columns has to be brought in the memory; so this page in page out will result in a lot of I/O, which may result in prolonged processing time.

In column-oriented databases, each column will be stored in pages. If we need to fetch a specific column, there will be less I/O as only the pages that contain the specified column needed to be brought in the main memory and read, and we need not bring and read all the pages containing rows/records henceforth into the memory. So the kind of queries where we need to just fetch specific columns and not whole record(s) or sets is served best in column-oriented database, which is useful for analytics wherein we can fetch some columns and do some mathematical operations such as sum and average.

Pros and cons of column-oriented databases

The following are pros of column-oriented database:

  • This has built-in support for efficient and data compression.
  • This supports fast data retrieval.
  • Administration and configuration is simplified. It can be scaled out and hence is very easy to expand.
  • This is good for high performance on aggregation queries (such as COUNT, SUM, AVG, MIN, and MAX).
  • This is efficient for partitioning as it provides features of automatic sharding mechanism to distribute bigger regions to smaller ones.

The following are cons of column-oriented database:

  • Queries with JOINs and data from many tables are not optimized.
  • It records and deletes lot of updates and has to make frequent compaction and splits too. This reduces its storage efficiency.
  • Partitioning or indexing schemes can be difficult to design as a relational concept is not implicit.

About the internal storage architecture of HBase

The following figure shows the principle algorithm and data structure HBase works on, that is, LSM-tree, and the way of merging, and precedes the explanation:

About the internal storage architecture of HBase

HBase stores file using LSM-tree, which maintains data in two separate parts that are optimized for underlying storage. This type of data structure depends on two structures, a current and smaller one in memory and a bigger one on the persistent disk, and once the part in memory becomes bigger than a certain limit, it is merged with the bigger structure that is stored on the disk using a merge sort algorithm and a new in-memory tree is created for newer insert requests. It transforms random data access into sequential data access, which improves read performance, and merging is a background process, which does not affect the foreground processing.

Getting started with HBase

We will discuss this section in a bit questionnaire manner, and will come to understand HBase with the help of scenarios and conditions.

When it started

The following figure shows the flow of HBase: birth, growth, and current status:

When it started

It all started in 2006 when a company called Powerset (later acquired by Microsoft) was looking forward to building a natural language search engine. The person responsible for the development was Jim Kellerman and there were also many other contributors. It was modeled around the Google BigTable white paper that came out in 2006, which was running on Google File System (GFS).

It started with a TAR file with a random bunch of Java files with initial HBase code. It was first added to the contrib directory of Hadoop as a small subcomponent of Hadoop and with the dedicated effort for filling up gaps; it has slowly and steadily grown into a full-fledged project. It was first added with Hadoop 0.1.0 and as it become more and more feature rich and stable, it was promoted to Hadoop subproject and then slowly with more and more development and contribution from the HBase user and developer group, it has became one of the top-level projects at Apache.

The following figure shows HBase versions from the beginning till now:

When it started

Let's have a look at the year-by-year evolution of HBase's important features:

  • 2006: The idea of HBase started with the white paper of Google BigTable
  • 2007: Data compression on the per column family was made available, addition and deletion of column families online was added, script to start/stop HBase cluster was added, MapReduce connecter was added, HBase shell support was added, support of row and column filter was added, algorithm to distribute region was evenly added, first rest interface was added, and hosted the first HBase meetup
  • 2008: HBase 0.1.0 to 0.18.1, HBase moved to new SVN, HBase added as a contribution to Hadoop, HBase become Hadoop subproject, first separate release became available, and Ruby shell added
  • 2009: HBase 0.19.0 to 0.20.*, improvement in writing and scanning, batching of writing and scanning, block compression, HTable interface to REST, addition of binary comparator, addition of regular expression filters, and many more
  • 2010 till date: HBase 0.89.* - 0.94.*, support for HDFS durability, improvement in import flow, support for MurmurHash3, addition of daemon threads for NameNode and DataNode to indicate the VM or kernel-caused pause in application log, tags support for key value, running MapReduce over snapshot files, introduction of transparent encryption of HBase on disk data, addition of per key-value security, offline rebuilding .META. from file system data, snapshot support, and many more

    Note

    For more information, just visit https://issues.apache.org/jira/browse/HBASE and explore more detailed explanations and more lists of improvements and the addition of new features.

  • 0.96 to 1.0 and Future: HBase Version 1 and higher, add utility for adorning HTTP context, fully high availability with Hadoop HA, rolling upgrades, improved failure detection and recovery, cell-level access security, inline cell tagging, quota and grouping, reverse scan, rolling upgrade, and it will be more useful for analytics purposes and helpful for data scientists

    Note

    While we wait for new features in v1.0; we can always visit http://hbase.apache.org for the latest releases and features.

    Here is the link from where we can download HBase versions:

    http://apache.mirrors.tds.net/hbase/stable

Let's now discuss HBase and Hadoop compatibility and the features they provide together.

Prior to Hadoop v1.0, when DataNode used to crash, HBase Write-Ahead Log—the logfiles that maintain the read/write operation before the final writing is done to the MemStore—would be lost and hence the data too. This version of Hadoop integrated append branch into the main core, which increased the durability for HBase. Hadoop v1.0 has also implemented the facility of disk failure, making RegionServer more robust.

Hadoop v2.0 has integrated high availability of NameNode, which also enables HBase to be more reliable and robust by enabling the multiple HMaster instances. Now with this version of HBase, upgrading has become easy because it is made independent of HDFS upgrades. Let's see in the following table how recent versions of Hadoop have enhanced HBase on the basis of performance, availability, and features:

Criteria

Hadoop v1.0

Hadoop v2.0

Hadoop v2.x

Features

Durability using hflush()

Performance

Short-circuit read

  • Native CRC
  • Datanodekeepalive
  • Direct write API (HBase provides the utility classes and the ImportTSV tool itself to write directly into HFile. Then, using the IncrementalLoadHFile, these files are loaded into the regions managed by RS. Once these two steps are over, client can read the data normally)
  • Zero copy API (in this operation, the CPU does not copy data from one memory area to another. This is used to save on processing power and memory when sending files over a network)
  • Direct codec API (used at server side for writing cells to WAL as well as for sending edits as part of the distributed-splitting process)

Note

Miscellaneous features in newer HBase are HBase isolation and allocation, online-automated repair of table integrity and region consistency problems, dynamic configuration changes, reverse scanning (stop row to start row), and many other features; users can visit https://issues.apache.org/jira/browse/HBASE for features and advancement of each HBase release.

HBase components and functionalities

Here let's discuss various components of HBase and their components recursively:

  • ZooKeeper
  • HMaster
  • RegionServer
  • Client
  • Catalog tables

ZooKeeper

ZooKeeper is a high-performance, centralized, multicoordination service system for distributed application, which provides a distributed synchronization and group service to HBase.

It enables the users and developer to focus on the application logic and not on the coordination with the cluster, for which it provides some API that can be used by the developers to use and implement coordination task such as master server, and managing application and cluster communication system.

In HBase, ZooKeeper is used to elect a cluster master in order to keep track of available and online servers, and to keep the metadata of the cluster. ZooKeeper APIs provide:

  • Consistency, ordering, and durability
  • Synchronization
  • Concurrency for a distributed clustered system

The following figure shows ZooKeeper:

ZooKeeper

It was developed at Yahoo Research. And the reason behind the name ZooKeeper is that in Hadoop system, projects are based on animal names, and in discussion regarding naming this technology, this name emerged as it manages the availability and coordination between different components of a distributed system.

ZooKeeper not only simplifies the development but also sits on the top distributed system as an abstraction layer to facilitate the better reachability to the components of the system. The following figure shows the request and response flow:

ZooKeeper

Let's consider a scenario wherein we have a few people who want to fill 10 rooms with some items. One instance would be where we will show how they find their way to the room to keep the items. Some of the rooms will be locked, which will lead the people to move on to other rooms. The other instance would be where we can allocate some representatives with information about the rooms, condition of rooms, and state of rooms (open, closed, fit for storing, not fit, and so on). We can then send them with items to those representatives for the information. The representative will guide the person towards the right room, which is available for storage of items, and the person can directly move to the specified room and store the item. This will not only ease the communication and the storage process but also reduce the overhead from the process. The same technique can be applied in the case of the ZooKeepers.

ZooKeeper maintains a tree with ZooKeeper data internally called a znode. This can be of two types:

  1. Ephemeral, which is good for applications that need to understand whether a specific distributed resource is available or not.
  2. The persistent one will be stored till a client does not delete it explicitly and it stores some data of the application too.

Why an odd number of ZooKeepers?

ZooKeepers are based on a majority principle; it requires that we have a quorum of servers to be up, where quorum is ceil(n/2), for a cluster of three nodes ensemble means two nodes must be up and running at any point of time, and for five node ensemble, a minimum three nodes must be up. It's also important for election purpose for the ZooKeeper master. We will discuss more options of configuration and coding of ZooKeeper in later chapters.

HMaster

HMaster is the component of the HBase cluster that can be thought of as NameNode in the case of Hadoop cluster; likewise, it acts as a master for RegionServers running on different machines. It is responsible for monitoring all RegionServers in an HBase cluster and also provides an interface to all HBase metadata for the client operations. It also handles RegionServer failover, and region splits.

There may be more than one instance of HMaster in an HBase cluster that provides High Availability (HA). So, if we have more than one master, only one master is active at a time; at the start up time, all the masters compete to become the active master in the cluster and whichever wins becomes the active master of the cluster. Meanwhile, all other master instances remain passive till the active master crashes, shuts down, or loses a lease from the ZooKeeper.

In short, it is a coordination component in an HBase cluster, which also manages and enables us to perform an administrative task on the cluster.

Let's now discuss the flow of starting up the HMaster process:

  1. Block (do not serve requests) until it becomes active HMaster.
  2. Finish initialization.
  3. Enter loop until stopped.
  4. Do cleansing when it is stopped.

HMaster exports some of the following interfaces that are metadata-based methods to enable us to interact with HBase:

Related to

Facilities

HBase tables

Creating table, deleting table, enabling/disabling table, and modifying table

HBase column families

Adding columns, modifying columns, and removing columns

HBase table regions

Moving regions, assigning regions, and unassign regions

In HBase, there is a table called .META. (table name on file system), which keeps all information about regions that is referred by HMaster for information about the data. By default, HMaster runs on port number 60000 and its HTTP Web UI is available on port 60010, which can always be changed according to our need.

HMaster functionalities can be summarized as follows:

  • Monitors RegionServers
  • Handles RegionServers failover
  • Handles metadata changes
  • Assignment/unassignment of regions
  • Interfaces all metadata changes
  • Performs reload balancing in idle time
  • It publishes its location to client using ZooKeeper
  • HMaster Web UI provides all information about HBase cluster (table, regions, RegionServers and so on)

If a master node goes down

If master goes down, in this scenario, the cluster may continue working normally as clients talk directly to RegionServers. So, cluster may still function steadily. The HBase catalog table (.META. and -ROOT-) exists as HBase tables and it's not stored in master resistant memory. However, as master performs critical functions such as RegionServers' failovers and region splits, these functions may be hampered and if not taken care will create a huge setback to the overall cluster functioning, so the master must be started as soon as possible.

So now, Hadoop is HA enabled and thus HBase can always be made HA using multiple HMasters for better availability and robustness, so we can now consider having multiple HMaster.

RegionServer

RegionServers are responsible for holding the actual raw HBase data. Recall that in a Hadoop cluster, a NameNode manages the metadata and a DataNode holds the raw data. Likewise, in HBase, an HBase master holds the metadata and RegionServer's store. These are the servers that hold the HBase data, as we may already know that in Hadoop cluster, NameNode manages the metadata and DataNode holds the actual data. Likewise, in HBase cluster, RegionServers store the raw actual data. As you might guess, a RegionServer is run or is hosted on top of a DataNode, which utilizes the underlying DataNodes at underlying file system, that is, HDFS.

The following figure shows the architecture of RegionServer:

RegionServer

RegionServer performs the following tasks:

  • Serving regions(tables) assigned to it
  • Handling client read/write requests
  • Flushing cache to HDFS
  • Maintaining HLogs
  • Performing compactions
  • Responsible for handling region splits

Components of a RegionServer

The following are the components of RegionServers

  • Write-Ahead logs: This is also called edit. When data is read/modified to HBase, it's not directly written in the disk rather it is kept in memory for some time (threshold, which we can configure based on size and time). Keeping this data in memory may result in a loss of data if the machine goes down abruptly. So to solve this, the data is first written in an intermediate file, which is called Write-Ahead logfile and then in memory. So in the case of system failure, data can be reconstructed using this logfile.
  • HFile: These are the actual files where the raw data is stored physically on the disk. This is the actual store file.
  • Store: Here the HFile is stored. It corresponds to a column family for a table in HBase.
  • MemStore: This component is in memory data store; this resides in the main memory and records the current data operation. So, when data is stored in WAL, RegionServers stores key-value in memory store.
  • Region: These are the splits of HBase table; the table is divided into regions based on the key and are hosted by RegionServers. There may be different regions in a RegionServer.

We will discuss more about these components in the next chapter.

Client

Client is responsible for finding the RegionServer, which is hosting the particular row (data). It is done by querying the catalog tables. Once region is found, the client directly contacts RegionServers and performs the data operation. Once this information is fetched, it is cached by the client for further fast retrieval. The client can be written in Java or any other language using external APIs.

Catalog tables

There are two tables that maintain the information about all RegionServers and regions. This is a kind of metadata for the HBase cluster. The following are the two catalog tables that exist in HBase:

  • -ROOT-: This includes information about the location of .META. table
  • .META.: This table holds all regions and their locations

At the beginning of the start up process, the .mMeta location is set to root from where the actual metadata of tables are read and read/write continues. So, whenever a client wants to connect to HBase and read or write into table, these two tables are referred and information is returned to client for direct read and write to the RegionServers and the regions of the specific table.

Who is using HBase and why?

The following is a list of just a few companies that use HBase in production. There are many companies who are using HBase, so we will list a few and not all.

  • Adobe: They have an HBase cluster of 30 nodes and are ready to expand it. They use HBase in several areas from social services to structured data and processing for internal use.
  • Facebook: They use it for messaging infrastructure.
  • Twitter: They use it for a number of applications including people search, which relies on HBase internally for data generation, and also their operations team uses HBase as a time series database for cluster-wide monitoring/performance data.
  • Infolinks: They use it for process advertisement selection and user events for our in-text advertising network.
  • StumbleUpon: They use it with MapReduce data source to overcome traditional query speed limits in MySQL.
  • Trend Micro: They use it as cloud-based storage.
  • Yahoo!: They are use HBase to store document fingerprint for detecting near-duplicates. They have a cluster of a few nodes that run HDFS, MapReduce, and HBase. The table contains millions of rows; we use this for querying duplicated documents with real-time traffic.
  • Ancestry.com: This company uses it for DNA analysis.
  • UIDAI: This is an Indian government project; they use HBase for storing resident details.
  • Apache: They use it for maintaining wiki.
  • Mozilla: They are moving Socorro project to HBase.
  • eBay: They use HBase for indexing site inventory.

Note

And we can keep listing, but we will stop it here and for further information, please visit http://wiki.apache.org/hadoop/Hbase/PoweredBy.

When should we think of using HBase?

Using HBase is not the solution to all problems; however, it can solve a lot of problems efficiently. The first thing is that we should think about the amount of data; if we have a few million rows and a few read and writes, then we can avoid using it. However, think of billions of columns and thousands of read/write data operations in a short interval, we can surely think of using HBase.

Let's consider an example, Facebook uses HBase for its real-time messaging infrastructure and we can think of how many messages or rows of data Facebook will be receiving per second. Considering that amount of data and I/O, we can currently think of using HBase. The following list details a few scenarios when we can consider using HBase:

  • If data needs to have a dynamic or variable schema
  • If a number of columns contain more null values (blank columns)
  • When we have a huge number of dynamic rows
  • If our data contains a variable number of columns
  • If we need to maintain versions of data
  • If high scalability is needed
  • If we need in-built compression on records
  • If a high volume of I/O is needed

There are many other cases where we can use HBase and it can be beneficial, which is discussed in later chapters.

When not to use HBase

Let's now discuss some points when we don't compulsorily have to use HBase just because everyone else is using it:

  • When data is not in large amounts (in TBs and more)
  • When JOINs and relational DB features are needed
  • Don't go with the belief "every one is using it"
  • If RDBMS fits your requirements, use RDBMS

Understanding some open source HBase tools

The following is the list of some HBase tools that are available in the development world:

The Hadoop-HBase version compatibility table

As there are compatibility issues in almost all systems; likewise, HBase also has compatibility issues with Hadoop versions, which means all versions of HBase can't be used use on top of all Hadoop versions. The following is the version compatibility of Hadoop-HBase that should be kept in mind while configuring HBase on Hadoop (credit: Apache):

Hadoop versions

HBase 0.92.x

HBase 0.94.x

HBase 0.96.0

HBase 0.98.0

Hadoop 0.20.205

Supported

Not supported

Not supported

Not supported

Hadoop 0.22.x

Supported

Not supported

Not supported

Not supported

Hadoop 1.0.0-1.0.2

Supported

Supported

Not supported

Not supported

Hadoop 1.0.3+

Supported

Supported

Supported

Not supported

Hadoop 1.1.x

Not tested enough

Supported

Supported

Not supported

Hadoop 0.23.x

Not supported

Supported

Not tested enough

Not supported

Hadoop 2.0.x-alpha

Not supported

Not tested enough

Not supported

Not supported

Hadoop 2.1.0-beta

Not supported

Not tested enough

Supported

Not supported

Hadoop 2.2.0

Not supported

Not tested enough

Supported

Supported

Hadoop 2.x

Not supported

Not tested enough

Supported

Supported

We can always visit https://hbase.apache.org for more updated version compatibility between HBase and Hadoop.

Applications of HBase

The applications of HBase are as follows:

  • Medical: HBase is used in the medical field for storing genome sequences and running MapReduce on it, storing the disease history of people or an area, and many others.
  • Sports: HBase is used in the sports field for storing match histories for better analytics and prediction.
  • Web: HBase is used to store user history and preferences for better customer targeting.
  • Oil and petroleum: HBase is used in the oil and petroleum industry to store exploration data for analysis and predict probable places where oil can be found.
  • e-commerce: HBase is used for recording and storing logs about customer search history, and to perform analytics and then target advertisement for better business.
  • Other fields: HBase can be used in many other fields where it's needed to store petabytes of data and run analysis on it, for which traditional systems may take months. We will discuss more about use cases and industry usability in further chapters.

HBase pros and cons

Let's now briefly discuss HBase pros and cons.

The following are some advantages of HBase:

  • Great for analytics in association with Hadoop MapReduce
  • It can handle very large volumes of data
  • Supports scaling out in coordination with Hadoop file system even on commodity hardware
  • Fault tolerance
  • License free
  • Very flexible on schema design/no fixed schema
  • Can be integrated with Hive for SQL-like queries, which is better for DBAs who are more familiar with SQL queries
  • Auto-sharding
  • Auto failover
  • Simple client interface
  • Row-level atomicity, that is, the PUT operation will either write or fail

The following are some missing aspects:

  • Single point of failure (when only one HMaster is used)
  • No transaction support
  • JOINs are handled in MapReduce layer rather than the database itself
  • Indexed and sorted only on key, but RDBMS can be indexed on some arbitrary field
  • No built-in authentication or permissions

So overall, we can say if we are in a position to neglect these cons, we can go with HBase which provides many other benefits that are not there in RDBMS. We can see that it's still an evolving technology with Hadoop and with time, it will become more mature and rich, which will make it one of the best tools for analytical database and distributed fault tolerant database. It is an open source Apache project where users and developers can contribute and add more and more features.

Hadoop HBase and a combination of some other Hadoop subproject can do wonders in the data analysis field; using these technologies, the data can be a hidden treasure, which were stored somewhere uselessly as a dump and now they can be very beneficial for understanding various prospects of a specific industry.

Summary

So in this chapter, we discussed the introductory aspects of HBase and related projects. We have also discussed HBase's components and their place in the HBase ecosystem. This chapter then provided a brief historical context for HBase and we have related it with some common uses of HBase in the industry.

In the next chapter, we will begin with HBase, understanding the different considerations and prerequisites for getting started with HBase. We will also discuss some of the concerns a new user might face during their first encounter with HBase.

Left arrow icon Right arrow icon

Description

If you are an administrator or developer who wants to enter the world of Big Data and BigTables and would like to learn about HBase, this is the book for you.

What you will learn

  • Understand the fundamentals of HBase
  • Understand the prerequisites necessary to get started with HBase
  • Install and configure a new HBase cluster
  • Optimize an HBase cluster using different Hadoop and HBase parameters
  • Make clusters more reliable using different troubleshooting and maintenance techniques
  • Get to grips with the HBase data model and its operations
  • Get to know the benefits of using Hadoop tools/JARs for HBase

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Nov 25, 2014
Length: 326 pages
Edition : 1st
Language : English
ISBN-13 : 9781783985951
Vendor :
Apache
Category :
Languages :
Tools :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Nov 25, 2014
Length: 326 pages
Edition : 1st
Language : English
ISBN-13 : 9781783985951
Vendor :
Apache
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just Can$6 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just Can$6 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total Can$ 181.97
Hbase Essentials
Can$49.99
Mastering Hadoop
Can$69.99
Learning Hbase
Can$61.99
Total Can$ 181.97 Stars icon
Banner background image

Table of Contents

11 Chapters
1. Understanding the HBase Ecosystem Chevron down icon Chevron up icon
2. Let's Begin with HBase Chevron down icon Chevron up icon
3. Let's Start Building It Chevron down icon Chevron up icon
4. Optimizing the HBase/Hadoop Cluster Chevron down icon Chevron up icon
5. The Storage, Structure Layout, and Data Model of HBase Chevron down icon Chevron up icon
6. HBase Cluster Maintenance and Troubleshooting Chevron down icon Chevron up icon
7. Scripting in HBase Chevron down icon Chevron up icon
8. Coding HBase in Java Chevron down icon Chevron up icon
9. Advance Coding in Java for HBase Chevron down icon Chevron up icon
10. HBase Use Cases Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.1
(8 Ratings)
5 star 62.5%
4 star 12.5%
3 star 0%
2 star 25%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




Sören Mar 24, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
For a future project, I had to read up on both Hadoop and HBase. For HBase, I decided to purchase Learning HBase by Shashwat Shriparv. The author starts by introducing HBase and the ecosystem and later guides the reader through installation and configuration (of both the Apache and Cloudera flavours). Of course, the coding aspects such as scripting and Java development are not forgotten and there is a separate chapter on maintenance and troubleshooting, which is very handy to have. In my case, the decision for HBase had already been made, but the uses cases in the last chapter were nonetheless very interesting. The book is written in an accessible style with enough detail and coverage to get started with HBase. I am glad I bought it and can definitely recommend it as a starting point for the HBase beginner.
Amazon Verified review Amazon
D Mar 27, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Learning HBASE book contains everything a beginner needs to get started with HBASE. The book provides the reader basic understanding of HBASE concepts as well as Hadoop and Zookeeper. The author does a nice job of walking through the reader with installing, running, using, and maintaining HBase.
Amazon Verified review Amazon
Phani Mutyala Mar 24, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I am in love with this book. The concept, the way the explanation has been give was solid. The best part I liked was the example that have been provided throughout the book and for every use case that was described. My Learning path on NoSQL got changed after my readings. I would certainly recommend this book to the audience that are beginners to NoSQL word.
Amazon Verified review Amazon
Natalia Mar 20, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book has been a great source for me when I was studying for Cloudera HBase Specialist exam earlier this year (which I passed). I would say the audience for this book is people who have been exposed to various Big Data technologies and have good understanding of HDFS. Which was great for me because I did not have to skip sections that are often added to the books as an intro to Hadoop ecosystem. "HBase Use Cases" portion was very informative for me personally. Chapter on Coding HBase in Java covers material that would generally interest development team but I found it very useful for me (my duties can be mostly described as Hadoop administration) because information there is tied to HBase API topics that - as an admin - you should know.
Amazon Verified review Amazon
varun Mar 16, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I liked this book,this book is very useful for developer and administrator.He has covered all the issue and challenges faced in real time and best practices in implementation and developing of Hbase.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.