Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Apache Hadoop 3 Quick Start Guide

You're reading from   Apache Hadoop 3 Quick Start Guide Learn about big data processing and analytics

Arrow left icon
Product type Paperback
Published in Oct 2018
Publisher Packt
ISBN-13 9781788999830
Length 220 pages
Edition 1st Edition
Languages
Tools
Concepts
Arrow right icon
Author (1):
Arrow left icon
Hrishikesh Vijay Karambelkar Hrishikesh Vijay Karambelkar
Author Profile Icon Hrishikesh Vijay Karambelkar
Hrishikesh Vijay Karambelkar
Arrow right icon
View More author details
Toc

Table of Contents (10) Chapters Close

Preface 1. Hadoop 3.0 - Background and Introduction 2. Planning and Setting Up Hadoop Clusters FREE CHAPTER 3. Deep Dive into the Hadoop Distributed File System 4. Developing MapReduce Applications 5. Building Rich YARN Applications 6. Monitoring and Administration of a Hadoop Cluster 7. Demystifying Hadoop Ecosystem Components 8. Advanced Topics in Apache Hadoop 9. Other Books You May Enjoy

Choosing the right Hadoop distribution

We have seen the evolution of Hadoop from a simple lab experiment tool to one of the most famous projects of Apache Software Foundation in the previous section. When the evolution started, many commercial implementations of Hadoop spawned. Today, we see more than 10 different implementations that exist in the market (Source). There is a debate about whether to go with full open source-based Hadoop or with a commercial Hadoop implementation. Each approach has its pros and cons. Let's look at the open source approach.

Pros of open source-based Hadoop include the following:

  • With a complete open source approach, you can take full advantage of community releases.
  • It's easier and faster to reach customers due to software being free. It also reduces the initial cost of investment.
  • Open source Hadoop supports open standards, making it easy to integrate with any system.

Cons of open source-based Hadoop include the following:

  • In the complete open source Hadoop scenario, it takes longer to build implementations compared to commercial software, due to lack of handy tools that speed up implementation
  • Supporting customers and fixing issues can become a tedious job due to the chaotic nature of the open source community
  • The roadmap of the product cannot be controlled/ginfluenced based on business needs

Given these challenges, many times, companies prefer to go with commercial implementations of Apache Hadoop. We will cover some of the key Hadoop distributions in this section.

Cloudera Hadoop distribution

Cloudera is well known and one of the oldest big data implementation players in the market. They have done first commercial releases of Hadoop in the past. Along with a Hadoop core distribution called CDH, Cloudera today provides many innovative tools such as proprietary Cloudera Manager to administer, monitor, and manage the Cloudera platform; Cloudera Director to easily deploy Cloudera clusters across the cloud; Cloudera Data Science Workbench to analyze large data and create statistical models out of it; and Cloudera Navigator to provide governance on the Cloudera platform. Besides ready-to-use products, it also provides services such as training and support. Cloudera follows separate versioning for its CDH; the latest CDH (5.14) uses Apache Hadoop 2.6.

Pros of Cloudera include the following:

  • Cloudera comes with many tools that can help speed up the overall cluster creation process
  • Cloudera-based Hadoop distribution is one of the most mature implementations of Hadoop so far
  • The Cloudera User Interface and features such as the dashboard management and wizard-based deployment offer an excellent support system while implementing and monitoring Hadoop clusters
  • Cloudera is focusing beyond Hadoop; it has brought in a new era of enterprise data hubs, along with many other tools that can handle much more complex business scenarios instead of just focusing on Hadoop distributions

Cons of Cloudera include the following:

  • Cloudera distribution is not completely open source; there are proprietary components that require users to use commercial licenses. Cloudera offers a limited 60-day trial license.

Hortonworks Hadoop distribution

Hortonworks, although late in the game (founded in 2011), has quickly emerged as a leading vendor in the big data market. Hortonworks was started by Yahoo engineers. The biggest differentiator between Hortonworks and other Hadoop distributions is that Hortonworks is the only commercial vendor to offer its enterprise Hadoop distribution completely free and 100% open source. Unlike Cloudera, Hortonworks focuses on embedding Hadoop in existing data platforms. Hortonworks has two major product releases. Hortonworks Data Platform (HDP) provides an enterprise-grade open source Apache Hadoop distribution, while Hortonworks Data Flow (HDF) provides the only end-to-end platform that collects, curates, analyzes, and acts on data in real time and on-premises or in the cloud, with a drag-and-drop visual interface. In addition to products, Hortonworks also provides services such as training, consultancy, and support through its partner network. Now, let's look at its pros and cons.

Pros of the Hortonworks Hadoop distribution include the following:

  • 100% open source-based enterprise Hadoop implementation with commercial license need
  • Hortonworks provides additional open source-based tools to monitor and administer clusters

Cons of the Hortonworks Hadoop distribution include the following:

  • As a business strategy, Hortonworks has focused on developing the platform layer so, for customers planning to utilize Hortonworks clusters, the cost to build capabilities is higher

MapR Hadoop distribution

MapR is one of the initial companies that started working on their own Hadoop distribution. When it comes to a Hadoop distribution, MapR has gone one step further and replaced HDFS of Hadoop with its own proprietary filesystem called MapRFS. MapRFS is a filesystem that supports enterprise-grade features such as better data management, fault tolerance, and ease of use. One key differentiator between HDFS and MapRFS is that MapRFS allows random writes on its filesystem. Additionally, unlike HDFS, it can be mounted locally through NFS to any filesystem. MapR implements POSIX (HDFS has POSIX-like implementation), so any Linux developer can apply their knowledge to run different commands seamlessly. MapR-like filesystems can be utilized for OLTP-like business requirements due to its unique features.

Pros of the MapR Hadoop distribution include the following:

  • It's the only Hadoop distribution without Java dependencies (as MapR is based on C)
  • Offers excellent and production-ready Hadoop clusters
  • MapRFS is easy to use and it provides multi-node FS access on a local NFS mounted

Cons of the MapR Hadoop distribution include the following:

  • It gets more and more proprietary instead of open source. Many companies are looking for vendor-free development, so MapR does not fit there.

Each of the distributions, including open source, that we covered have unique business strategy and features. Choosing the right Hadoop distribution for a problem is driven by multiple factors such as the following:

  • What kind of application needs to be addressed by Hadoop
  • The type of application—transactional or analytical—and what are the key data processing requirements
  • Investments and the timeline of project implementation
  • Support and training requirements of a given project

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime