Choosing the right Hadoop distribution

We have seen the evolution of Hadoop from a simple lab experiment tool to one of the most famous projects of Apache Software Foundation in the previous section. When the evolution started, many commercial implementations of Hadoop spawned. Today, we see more than 10 different implementations that exist in the market (Source). There is a debate about whether to go with full open source-based Hadoop or with a commercial Hadoop implementation. Each approach has its pros and cons. Let's look at the open source approach.

Pros of open source-based Hadoop include the following:

With a complete open source approach, you can take full advantage of community releases.
It's easier and faster to reach customers due to software being free. It also reduces the initial cost of investment.
Open source Hadoop supports open standards, making it easy to integrate with any system.

Cons of open source-based Hadoop include the following:

In the complete open source Hadoop scenario, it takes longer to build implementations compared to commercial software, due to lack of handy tools that speed up implementation
Supporting customers and fixing issues can become a tedious job due to the chaotic nature of the open source community
The roadmap of the product cannot be controlled/ginfluenced based on business needs

Given these challenges, many times, companies prefer to go with commercial implementations of Apache Hadoop. We will cover some of the key Hadoop distributions in this section.

Cloudera Hadoop distribution

Cloudera is well known and one of the oldest big data implementation players in the market. They have done first commercial releases of Hadoop in the past. Along with a Hadoop core distribution called CDH, Cloudera today provides many innovative tools such as proprietary Cloudera Manager to administer, monitor, and manage the Cloudera platform; Cloudera Director to easily deploy Cloudera clusters across the cloud; Cloudera Data Science Workbench to analyze large data and create statistical models out of it; and Cloudera Navigator to provide governance on the Cloudera platform. Besides ready-to-use products, it also provides services such as training and support. Cloudera follows separate versioning for its CDH; the latest CDH (5.14) uses Apache Hadoop 2.6.

Pros of Cloudera include the following:

Cloudera comes with many tools that can help speed up the overall cluster creation process
Cloudera-based Hadoop distribution is one of the most mature implementations of Hadoop so far
The Cloudera User Interface and features such as the dashboard management and wizard-based deployment offer an excellent support system while implementing and monitoring Hadoop clusters
Cloudera is focusing beyond Hadoop; it has brought in a new era of enterprise data hubs, along with many other tools that can handle much more complex business scenarios instead of just focusing on Hadoop distributions

Cons of Cloudera include the following:

Cloudera distribution is not completely open source; there are proprietary components that require users to use commercial licenses. Cloudera offers a limited 60-day trial license.

Hortonworks Hadoop distribution

Hortonworks, although late in the game (founded in 2011), has quickly emerged as a leading vendor in the big data market. Hortonworks was started by Yahoo engineers. The biggest differentiator between Hortonworks and other Hadoop distributions is that Hortonworks is the only commercial vendor to offer its enterprise Hadoop distribution completely free and 100% open source. Unlike Cloudera, Hortonworks focuses on embedding Hadoop in existing data platforms. Hortonworks has two major product releases. Hortonworks Data Platform (HDP) provides an enterprise-grade open source Apache Hadoop distribution, while Hortonworks Data Flow (HDF) provides the only end-to-end platform that collects, curates, analyzes, and acts on data in real time and on-premises or in the cloud, with a drag-and-drop visual interface. In addition to products, Hortonworks also provides services such as training, consultancy, and support through its partner network. Now, let's look at its pros and cons.

Pros of the Hortonworks Hadoop distribution include the following:

100% open source-based enterprise Hadoop implementation with commercial license need
Hortonworks provides additional open source-based tools to monitor and administer clusters

Cons of the Hortonworks Hadoop distribution include the following:

As a business strategy, Hortonworks has focused on developing the platform layer so, for customers planning to utilize Hortonworks clusters, the cost to build capabilities is higher

MapR Hadoop distribution

MapR is one of the initial companies that started working on their own Hadoop distribution. When it comes to a Hadoop distribution, MapR has gone one step further and replaced HDFS of Hadoop with its own proprietary filesystem called MapRFS. MapRFS is a filesystem that supports enterprise-grade features such as better data management, fault tolerance, and ease of use. One key differentiator between HDFS and MapRFS is that MapRFS allows random writes on its filesystem. Additionally, unlike HDFS, it can be mounted locally through NFS to any filesystem. MapR implements POSIX (HDFS has POSIX-like implementation), so any Linux developer can apply their knowledge to run different commands seamlessly. MapR-like filesystems can be utilized for OLTP-like business requirements due to its unique features.

Pros of the MapR Hadoop distribution include the following:

It's the only Hadoop distribution without Java dependencies (as MapR is based on C)
Offers excellent and production-ready Hadoop clusters
MapRFS is easy to use and it provides multi-node FS access on a local NFS mounted

Cons of the MapR Hadoop distribution include the following:

It gets more and more proprietary instead of open source. Many companies are looking for vendor-free development, so MapR does not fit there.

Each of the distributions, including open source, that we covered have unique business strategy and features. Choosing the right Hadoop distribution for a problem is driven by multiple factors such as the following:

What kind of application needs to be addressed by Hadoop
The type of application—transactional or analytical—and what are the key data processing requirements
Investments and the timeline of project implementation
Support and training requirements of a given project