Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Hadoop Essentials
Hadoop Essentials

Hadoop Essentials: Delve into the key concepts of Hadoop and get a thorough understanding of the Hadoop ecosystem

eBook
$9.99 $26.99
Paperback
$32.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Hadoop Essentials

Chapter 1. Introduction to Big Data and Hadoop

Hello big data enthusiast! By this time, I am sure you must have heard a lot about big data, as big data is the hot IT buzzword and there is a lot of excitement about big data. Let us try to understand the necessities of big data. There are humungous amount of data, available on the Internet, at institutions, and with some organizations, which have a lot of meaningful insights, which can be analyzed using data science techniques and involves complex algorithms. Data science techniques require a lot of processing time, intermediate data(s), and CPU power, that may take roughly tens of hours on gigabytes of data and data science works on a trial and error basis, to check if an algorithm can process the data better or not to get such insights. Big data systems can process data analytics not only faster but also efficiently for a large data and can enhance the scope of R&D analysis and can yield more meaningful insights and faster than any other analytic or BI system.

Big data systems have emerged due to some issues and limitations in traditional systems. The traditional systems are good for Online Transaction Processing (OLTP) and Business Intelligence (BI), but are not easily scalable considering cost, effort, and manageability aspect. Processing heavy computations are difficult and prone to memory issues, or will be very slow, which hinders data analysis to a greater extent. Traditional systems lack extensively in data science analysis and make big data systems powerful and interesting. Some examples of big data use cases are predictive analytics, fraud analytics, machine learning, identifying patterns, data analytics, semi-structured, and unstructured data processing and analysis.

V's of big data

Typically, the problem that comes in the bracket of big data is defined by terms that are often called as V's of big data. There are typically three V's, which are Volume, Velocity, and variety, as shown in the following image:

V's of big data

Volume

According to the fifth annual survey by International Data Corporation (IDC), 1.8 zettabytes (1.8 trillion gigabytes) of information were created and replicated in 2011 alone, which is up from 800 GB in 2009, and the number is expected to more than double every two years surpassing 35 zettabytes by 2020. Big data systems are designed to store these amounts of data and even beyond that too with a fault tolerant architecture, and as it is distributed and replicated across multiple nodes, the underlying nodes can be average computing systems, which too need not be high performing systems, which reduces the cost drastically.

The cost per terabyte storage in big data is very less than in other systems, and this has made organizations interested to a greater extent, and even if the data grows multiple times, it is easily scalable, and nodes can be added without much maintenance effort.

Velocity

Processing and analyzing the amount of data that we discussed earlier is one of the key interest areas where big data is gaining popularity and has grown enormously. Not all data to be processed has to be larger in volume initially, but as we process and execute some complex algorithms, the data can grow massively. For processing most of the algorithms, we would require intermediate or temporary data, which can be in GB or TB for big data, so while processing, we would require some significant amount of data, and processing also has to be faster. Big data systems can process huge complex algorithms on huge data much quickly, as it leverages parallel processing across distributed environment, which executes multiple processes in parallel at the same time, and the job can be completed much faster.

For example, Yahoo created a world record in 2009 using Apache Hadoop for sorting a petabyte in 16.25 hours and a terabyte in 62 seconds. MapR have achieved terabyte data sorting in 55 seconds, which speaks volume for the processing power, especially in analytics where we need to use a lot of intermediate data to perform heavy time and memory intensive algorithms much faster.

Variety

Another big challenge for the traditional systems is to handle different variety of semi-structured data or unstructured data such as e-mails, audio and video analysis, image analysis, social media, gene, geospatial, 3D data, and so on. Big data can not only help store, but also utilize and process such data using algorithms much more quickly and also efficiently. Semi-structured and unstructured data processing is complex, and big data can use the data with minimal or no preprocessing like other systems and can save a lot of effort and help minimize loss of data.

Understanding big data

Actually, big data is a terminology which refers to challenges that we are facing due to exponential growth of data in terms of V problems. The challenges can be subdivided into the following phases:

  • Capture
  • Storage
  • Search
  • Sharing
  • Analytics
  • Visualization

Big data systems refer to technologies that can process and analyze data, which we discussed as volume, velocity, and variety data problems. The technologies that can solve big data problems should use the following architectural strategy:

  • Distributed computing system
  • Massively parallel processing (MPP)
  • NoSQL (Not only SQL)
  • Analytical database

The structure is as follows:

Understanding big data

Big data systems use distributed computing and parallel processing to handle big data problems. Apart from distributed computing and MPP, there are other architectures that can solve big data problems that are toward database environment based system, which are NoSQL and Advanced SQL.

NoSQL

A NoSQL database is a widely adapted technology due to the schema less design, and its ability to scale up vertically and horizontally is fairly simple and in much less effort. SQL and RDBMS have ruled for more than three decades, and it performs well within the limits of the processing environment, and beyond that the RDBMS system performance degrades, cost increases, and manageability decreases, we can say that NoSQL provides an edge over RDBMS in these scenarios.

Note

One important thing to mention is that NoSQLs do not support all ACID properties and are highly scalable, provide availability, and are also fault tolerant. NoSQL usually provides either consistency or availability (availability of nodes for processing), depending upon the architecture and design.

Types of NoSQL databases

As the NoSQL databases are nonrelational they have different sets of possible architecture and design. Broadly, there are four general types of NoSQL databases, based on how the data is stored:

  1. Key-value store: These databases are designed for storing data in a key-value store. The key can be custom, can be synthetic, or can be autogenerated, and the value can be complex objects such as XML, JSON, or BLOB. Key of data is indexed for faster access to the data and improving the retrieval of value. Some popular key-value type databases are DynamoDB, Azure Table Storage (ATS), Riak, and BerkeleyDB.
  2. Column store: These databases are designed for storing data as a group of column families. Read/write operation is done using columns, rather than rows. One of the advantages is the scope of compression, which can efficiently save space and avoid memory scan of the column. Due to the column design, not all files are required to be scanned, and each column file can be compressed, especially if a column has many nulls and repeating values. A column stores databases that are highly scalable and have very high performance architecture. Some popular column store type databases are HBase, BigTable, Cassandra, Vertica, and Hypertable.
  3. Document database: These databases are designed for storing, retrieving, and managing document-oriented information. A document database expands on the idea of key-value stores where values or documents are stored using some structure and are encoded in formats such as XML, YAML, or JSON, or in binary forms such as BSON, PDF, Microsoft Office documents (MS Word, Excel), and so on. The advantage in storing in an encoded format like XML or JSON is that we can search with the key within the document of a data, and it is quite useful in ad hoc querying and semi-structured data. Some popular document-type databases are MongoDB and CouchDB.
  4. Graph database: These databases are designed for data whose relations are well represented as trees or a graph, and has elements, usually with nodes and edges, which are interconnected. Relational databases are not so popular in performing graph-based queries as they require a lot of complex joins, and thus managing the interconnection becomes messy. Graph theoretic algorithms are useful for prediction, user tracking, clickstream analysis, calculating the shortest path, and so on, which will be processed by graph databases much more efficiently as the algorithms themselves are complex. Some popular graph-type databases are Neo4J and Polyglot.

Analytical database

An analytical database is a type of database built to store, manage, and consume big data. Analytical databases are vendor-managed DBMS, which are optimized for processing advanced analytics that involves highly complex queries on terabytes of data and complex statistical processing, data mining, and NLP (natural language processing). Examples of analytical databases are Vertica (acquired by HP), Aster Data (acquired by Teradata), Greenplum (acquired by EMC), and so on.

Who is creating big data?

Data is growing exponentially, and comes from multiple sources that are emitting data continuously and consistently. In some domains, we have to analyze the data that are processed by machines, sensors, quality, equipment, data points, and so on. A list of some sources that are creating big data is mentioned as follows:

  • Monitoring sensors: Climate or ocean wave monitoring sensors generate data consistently and in a good size, and there would be more than millions of sensors that capture data.
  • Posts to social media sites: Social media websites such as Facebook, Twitter, and others have a huge amount of data in petabytes.
  • Digital pictures and videos posted online: Websites such as YouTube, Netflix, and others process a huge amount of digital videos and data that can be petabytes.
  • Transaction records of online purchases: E-commerce sites such as eBay, Amazon, Flipkart, and others process thousands of transactions on a single time.
  • Server/application logs: Applications generate log data that grows consistently, and analysis on these data becomes difficult.
  • CDR (call data records): Roaming data and cell phone GPS signals to name a few.
  • Science, genomics, biogeochemical, biological, and other complex and/or interdisciplinary scientific research.

Big data use cases

Let's look at the credit card issuer (use case demonstrated by MapR).

A credit card issuer client wants to improve the existing recommendation system that is lagging and can have potentially huge profits if recommendations can be faster.

The existing system is an Enterprise Data Warehouse (EDW), which is very costly and slower in generating recommendations, which, in turn, impacts on potential profits. As Hadoop is cheaper and faster, it will generate huge profits than the existing system.

Usually, a credit card customer will have data like the following:

  • Customer purchase history (big)
  • Merchant designations
  • Merchant special offers

Let's analyze a general comparison of existing EDW platforms with a big data solution. The recommendation system is designed using Mahout (scalable Machine Learning library API) and Solr/Lucene. Recommendation is based on the co-occurrence matrix implemented as the search index.

The time improvement benchmarked was from 20 hours to just 3 hours, which is unbelievably six times less, as shown in the following image:

Big data use cases

In the web tier in the following image, we can see that the improvement is from 8 hours to 3 minutes:

Big data use cases

So, eventually, we can say that time decreases, revenue increases, and Hadoop offers a cost-effective solution, hence profit increases, as shown in the following image:

Big data use cases

Big data use case patterns

There are many technological scenarios, and some of them are similar in pattern. It is a good idea to map scenarios with architectural patterns. Once these patterns, are understood, they become the fundamental building blocks of solutions. We will discuss five types of patterns in the following section.

Note

This solution is not always optimized, and it may depend on domain data, type of data, or some other factors. These examples are to visualize a problem and they can help to find a solution.

Big data as a storage pattern

Big data systems can be used as a storage pattern or as a data warehouse, where data from multiple sources, even with different types of data, can be stored and can be utilized later. The usage scenario and use case are as follows:

  • Usage scenario:
    • Data getting continuously generated in large volumes
    • Need for preprocessing before getting loaded into the target system
  • Use case:
    • Machine data capture for subsequent cleansing can be merged in multiple or single big file(s) and can be loaded in a Hadoop to compute
    • Unstructured data across multiple sources should be captured for subsequent analysis on emerging patterns
    • Data loaded in Hadoop should be processed and filtered, and depending on the data, we can have the storage as a data warehouse, Hadoop, or any NoSQL system.

The storage pattern is shown in the following figure:

Big data as a storage pattern

Big data as a data transformation pattern

Big data systems can be designed to perform transformation as the data loading and cleansing activity, and many transformations can be done faster than traditional systems due to parallelism. Transformation is one phase in the Extract–Transform–Load of data ingestion and cleansing phase. The usage scenario and use case are as follows:

  • Usage scenario
    • A large volume of raw data to be preprocessed
    • Data type includes structured as well as non-structured data
  • Use case
    • Evolution of ETL (Extract–Transform–Load) tools to leverage big data, for example, Pentaho, Talend, and so on. Also, in Hadoop, ELT (Extract–Load–Transform) is also trending, as the loading will be faster in Hadoop, and cleansing can run a parallel process to clean and transform the input, which will be faster

The data transformation pattern is shown in the following figure:

Big data as a data transformation pattern

Big data for a data analysis pattern

Data analytics is of wider interest in big data systems, where a huge amount of data can be analyzed to generate statistical reports and insights about the data, which can be useful in business and understanding of patterns. The usage scenario and use case are as follows:

  • Usage scenario
    • Improved response time for detection of patterns
    • Data analysis for non-structured data
  • Use case
    • Fast turnaround for machine data analysis (for example, analysis of seismic data)
    • Pattern detection across structured and non-structured data (for example, fraud analysis)

Big data for data in a real-time pattern

Big data systems integrating with some streaming libraries and systems are capable of handling high scale real-time data processing. Real-time processing for a large and complex requirement possesses a lot of challenges such as performance, scalability, availability, resource management, low latency, and so on. Some streaming technologies such as Storm and Spark Streaming can be integrated with YARN. The usage scenario and use case are as follows:

  • Usage scenario
    • Managing the action to be taken based on continuously changing data in real time
  • Use case
    • Automated process control based on real time from manufacturing equipments
    • Real-time changes to plant operations based on events from business systems Enterprise Resource Planning (ERPs)

The data in a real-time pattern is shown in the following figure:

Big data for data in a real-time pattern

Big data for a low latency caching pattern

Big data systems can be tuned as a special case for a low latency system, where reads are much higher and updates are low, which can fetch the data faster and can be stored in memory, which can further improve the performance and avoid overheads. The usage scenario and use case are as follows:

  • Usage scenario
    • Reads are far higher in ratio to writes
    • Reads require very low latency and a guaranteed response
    • Distributed location-based data caching
  • Use case
    • Order promising solutions
    • Cloud-based identity and SSO
    • Low latency real-time personalized offers on mobile

The low latency caching pattern is shown in the following pattern:

Big data for a low latency caching pattern

Some of the technology stacks that are widely used according to the layer and framework are shown in the following image:

Big data for a low latency caching pattern

Hadoop

In big data, the most widely used system is Hadoop. Hadoop is an open source implementation of big data, which is widely accepted in the industry, and benchmarks for Hadoop are impressive and, in some cases, incomparable to other systems. Hadoop is used in the industry for large-scale, massively parallel, and distributed data processing. Hadoop is highly fault tolerant and configurable to as many levels as we need for the system to be fault tolerant, which has a direct impact to the number of times the data is stored across.

As we have already touched upon big data systems, the architecture revolves around two major components: distributed computing and parallel processing. In Hadoop, the distributed computing is handled by HDFS, and parallel processing is handled by MapReduce. In short, we can say that Hadoop is a combination of HDFS and MapReduce, as shown in the following image:

Hadoop

We will cover the above mentioned two topics in detail in the next chapters.

Hadoop history

Hadoop began from a project called Nutch, an open source crawler-based search, which processes on a distributed system. In 2003–2004, Google released Google MapReduce and GFS papers. MapReduce was adapted on Nutch. Doug Cutting and Mike Cafarella are the creators of Hadoop. When Doug Cutting joined Yahoo, a new project was created along the similar lines of Nutch, which we call Hadoop, and Nutch remained as a separate sub-project. Then, there were different releases, and other separate sub-projects started integrating with Hadoop, which we call a Hadoop ecosystem.

The following figure and description depicts the history with timelines and milestones achieved in Hadoop:

Hadoop history

Description

  • 2002.8: The Nutch Project was started
  • 2003.2: The first MapReduce library was written at Google
  • 2003.10: The Google File System paper was published
  • 2004.12: The Google MapReduce paper was published
  • 2005.7: Doug Cutting reported that Nutch now uses new MapReduce implementation
  • 2006.2: Hadoop code moved out of Nutch into a new Lucene sub-project
  • 2006.11: The Google Bigtable paper was published
  • 2007.2: The first HBase code was dropped from Mike Cafarella
  • 2007.4: Yahoo! Running Hadoop on 1000-node cluster
  • 2008.1: Hadoop made an Apache Top Level Project
  • 2008.7: Hadoop broke the Terabyte data sort Benchmark
  • 2008.11: Hadoop 0.19 was released
  • 2011.12: Hadoop 1.0 was released
  • 2012.10: Hadoop 2.0 was alpha released
  • 2013.10: Hadoop 2.2.0 was released
  • 2014.10: Hadoop 2.6.0 was released

Advantages of Hadoop

Hadoop has a lot of advantages, and some of them are as follows:

  • Low cost—Runs on commodity hardware: Hadoop can run on average performing commodity hardware and doesn't require a high performance system, which can help in controlling cost and achieve scalability and performance. Adding or removing nodes from the cluster is simple, as an when we require. The cost per terabyte is lower for storage and processing in Hadoop.
  • Storage flexibility: Hadoop can store data in raw format in a distributed environment. Hadoop can process the unstructured data and semi-structured data better than most of the available technologies. Hadoop gives full flexibility to process the data and we will not have any loss of data.
  • Open source community: Hadoop is open source and supported by many contributors with a growing network of developers worldwide. Many organizations such as Yahoo, Facebook, Hortonworks, and others have contributed immensely toward the progress of Hadoop and other related sub-projects.
  • Fault tolerant: Hadoop is massively scalable and fault tolerant. Hadoop is reliable in terms of data availability, and even if some nodes go down, Hadoop can recover the data. Hadoop architecture assumes that nodes can go down and the system should be able to process the data.
  • Complex data analytics: With the emergence of big data, data science has also grown leaps and bounds, and we have complex and heavy computation intensive algorithms for data analysis. Hadoop can process such scalable algorithms for a very large-scale data and can process the algorithms faster.

Uses of Hadoop

Some examples of use cases where Hadoop is used are as follows:

  • Searching/text mining
  • Log processing
  • Recommendation systems
  • Business intelligence/data warehousing
  • Video and image analysis
  • Archiving
  • Graph creation and analysis
  • Pattern recognition
  • Risk assessment
  • Sentiment analysis

Hadoop ecosystem

A Hadoop cluster can be of thousands of nodes, and it is complex and difficult to manage manually, hence there are some components that assist configuration, maintenance, and management of the whole Hadoop system. In this book, we will touch base upon the following components in Chapter 2, Hadoop Ecosystem.

Layer

Utility/Tool name

Distributed filesystem

Apache HDFS

Distributed programming

Apache MapReduce

Apache Hive

Apache Pig

Apache Spark

NoSQL databases

Apache HBase

Data ingestion

Apache Flume

Apache Sqoop

Apache Storm

Service programming

Apache Zookeeper

Scheduling

Apache Oozie

Machine learning

Apache Mahout

System deployment

Apache Ambari

All the components above are helpful in managing Hadoop tasks and jobs.

Apache Hadoop

The open source Hadoop is maintained by the Apache Software Foundation. The official website for Apache Hadoop is http://hadoop.apache.org/, where the packages and other details are described elaborately. The current Apache Hadoop project (version 2.6) includes the following modules:

  • Hadoop common: The common utilities that support other Hadoop modules
  • Hadoop Distributed File System (HDFS): A distributed filesystem that provides high-throughput access to application data
  • Hadoop YARN: A framework for job scheduling and cluster resource management
  • Hadoop MapReduce: A YARN-based system for parallel processing of large datasets

Apache Hadoop can be deployed in the following three modes:

  • Standalone: It is used for simple analysis or debugging.
  • Pseudo distributed: It helps you to simulate a multi-node installation on a single node. In pseudo-distributed mode, each of the component processes runs in a separate JVM. Instead of installing Hadoop on different servers, you can simulate it on a single server.
  • Distributed: Cluster with multiple worker nodes in tens or hundreds or thousands of nodes.

In a Hadoop ecosystem, along with Hadoop, there are many utility components that are separate Apache projects such as Hive, Pig, HBase, Sqoop, Flume, Zookeper, Mahout, and so on, which have to be configured separately. We have to be careful with the compatibility of subprojects with Hadoop versions as not all versions are inter-compatible.

Apache Hadoop is an open source project that has a lot of benefits as source code can be updated, and also some contributions are done with some improvements. One downside for being an open source project is that companies usually offer support for their products, not for an open source project. Customers prefer support and adapt Hadoop distributions supported by the vendors.

Let's look at some Hadoop distributions available.

Hadoop distributions

Hadoop distributions are supported by the companies managing the distribution, and some distributions have license costs also. Companies such as Cloudera, Hortonworks, Amazon, MapR, and Pivotal have their respective Hadoop distribution in the market that offers Hadoop with required sub-packages and projects, which are compatible and provide commercial support. This greatly reduces efforts, not just for operations, but also for deployment, monitoring, and tools and utility for easy and faster development of the product or project.

For managing the Hadoop cluster, Hadoop distributions provide some graphical web UI tooling for the deployment, administration, and monitoring of Hadoop clusters, which can be used to set up, manage, and monitor complex clusters, which reduce a lot of effort and time.

Some Hadoop distributions which are available are as follows:

  • Cloudera: According to The Forrester Wave™: Big Data Hadoop Solutions, Q1 2014, this is the most widely used Hadoop distribution with the biggest customer base as it provides good support and has some good utility components such as Cloudera Manager, which can create, manage, and maintain a cluster, and manage job processing, and Impala is developed and contributed by Cloudera which has real-time processing capability.
  • Hortonworks: Hortonworks' strategy is to drive all innovation through the open source community and create an ecosystem of partners that accelerates Hadoop adoption among enterprises. It uses an open source Hadoop project and is a major contributor to Hadoop enhancement in Apache Hadoop. Ambari was developed and contributed to Apache by Hortonworks. Hortonworks offers a very good, easy-to-use sandbox for getting started. Hortonworks contributed changes that made Apache Hadoop run natively on the Microsoft Windows platforms including Windows Server and Microsoft Azure.
  • MapR: MapR distribution of Hadoop uses different concepts than plain open source Hadoop and its competitors, especially support for a network file system (NFS) instead of HDFS for better performance and ease of use. In NFS, Native Unix commands can be used instead of Hadoop commands. MapR have high availability features such as snapshots, mirroring, or stateful failover.
  • Amazon Elastic MapReduce (EMR): AWS's Elastic MapReduce (EMR) leverages its comprehensive cloud services, such as Amazon EC2 for compute, Amazon S3 for storage, and other services, to offer a very strong Hadoop solution for customers who wish to implement Hadoop in the cloud. EMR is much advisable to be used for infrequent big data processing. It might save you a lot of money.

Pillars of Hadoop

Hadoop is designed to be highly scalable, distributed, massively parallel processing, fault tolerant and flexible and the key aspect of the design are HDFS, MapReduce and YARN. HDFS and MapReduce can perform very large scale batch processing at a much faster rate. Due to contributions from various organizations and institutions Hadoop architecture has undergone a lot of improvements, and one of them is YARN. YARN has overcome some limitations of Hadoop and allows Hadoop to integrate with different applications and environments easily, especially in streaming and real-time analysis. One such example that we are going to discuss are Storm and Spark, they are well known in streaming and real-time analysis, both can integrate with Hadoop via YARN.

We will cover the concept of HDFS, MapReduce, and YARN in greater detail in Chapter 3, Pillars of Hadoop – HDFS, MapReduce, and YARN.

Data access components

MapReduce is a very powerful framework, but has a huge learning curve to master and optimize a MapReduce job. For analyzing data in a MapReduce paradigm, a lot of our time will be spent in coding. In big data, the users come from different backgrounds such as programming, scripting, EDW, DBA, analytics, and so on, for such users there are abstraction layers on top of MapReduce. Hive and Pig are two such layers, Hive has a SQL query-like interface and Pig has Pig Latin procedural language interface. Analyzing data on such layers becomes much easier.

We will cover the concept of Hive and Pig in greater detail in Chapter 4, Data Access Component – Hive and Pig.

Data storage component

HBase is a column store-based NoSQL database solution. HBase's data model is very similar to Google's BigTable framework. HBase can efficiently process random and real-time access in a large volume of data, usually millions or billions of rows. HBase's important advantage is that it supports updates on larger tables and faster lookup. The HBase data store supports linear and modular scaling. HBase stores data as a multidimensional map and is distributed. HBase operations are all MapReduce tasks that run in a parallel manner.

We will cover the concept of HBase in greater detail in Chapter 5, Storage Component—HBase.

Data ingestion in Hadoop

In Hadoop, storage is never an issue, but managing the data is the driven force around which different solutions can be designed differently with different systems, hence managing data becomes extremely critical. A better manageable system can help a lot in terms of scalability, reusability, and even performance. In a Hadoop ecosystem, we have two widely used tools: Sqoop and Flume, both can help manage the data and can import and export data efficiently, with a good performance. Sqoop is usually used for data integration with RDBMS systems, and Flume usually performs better with streaming log data.

We will cover the concept of Sqoop and Flume in greater detail in Chapter 6, Data Ingestion in Hadoop—Sqoop and Flume.

Streaming and real-time analysis

Storm and Spark are the two new fascinating components that can run on YARN and have some amazing capabilities in terms of processing streaming and real-time analysis. Both of these are used in scenarios where we have heavy continuous streaming data and have to be processed in, or near, real-time cases. The example which we discussed earlier for traffic analyzer is a good example for use cases of Storm and Spark.

We will cover the concept of Storm and Spark in greater detail in Chapter 7, Streaming and Real-time Analysis—Storm and Spark.

Summary

In this chapter, we spoke about the big data and its use case patterns. We explored a bit about Hadoop history, finally migrating to the advantages and uses of Hadoop.

Hadoop systems are complex to monitor and manage, and we have separate sub-projects' frameworks, tools, and utilities that integrate with Hadoop and help in better management of tasks, which are called a Hadoop ecosystem, and which we will be discussing in subsequent chapters.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Get to grips with the most powerful tools in the Hadoop ecosystem, including Storm and Spark
  • Learn everything you need to take control of Big Data
  • A fast-paced journey through the key features of Hadoop

Description

This book jumps into the world of Hadoop and its tools, to help you learn how to use them effectively to optimize and improve the way you handle Big Data. Starting with the fundamentals Hadoop YARN, MapReduce, HDFS, and other vital elements in the Hadoop ecosystem, you will soon learn many exciting topics such as MapReduce patterns, data management, and real-time data analysis using Hadoop. You will also explore a number of the leading data processing tools including Hive and Pig, and learn how to use Sqoop and Flume, two of the most powerful technologies used for data ingestion. With further guidance on data streaming and real-time analytics with Storm and Spark, Hadoop Essentials is a reliable and relevant resource for anyone who understands the difficulties - and opportunities - presented by Big Data today. With this guide, you'll develop your confidence with Hadoop, and be able to use the knowledge and skills you learn to successfully harness its unparalleled capabilities.

Who is this book for?

If you are a system or application developer interested in learning how to solve practical problems using the Hadoop framework, then this book is ideal for you. This book is also meant for Hadoop professionals who want to find solutions to the different challenges you come across in your projects. It assumes a familiarity with distributed storage and distributed applications.

What you will learn

  • Get to grips with the fundamentals of Hadoop, and tools such as HDFS, MapReduce, and YARN
  • Learn how to use Hadoop for realworld Big Data projects
  • Improve the performance of your Big Data architecture
  • Find out how to get the most from data processing tools such as Hive and Pig
  • Learn how to unlock realtime Big Data analytics with Apache Spark
Estimated delivery fee Deliver to South Africa

Standard delivery 10 - 13 business days

$12.95

Premium delivery 3 - 6 business days

$34.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Apr 29, 2015
Length: 194 pages
Edition : 1st
Language : English
ISBN-13 : 9781784396688
Category :
Languages :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to South Africa

Standard delivery 10 - 13 business days

$12.95

Premium delivery 3 - 6 business days

$34.95
(Includes tracking information)

Product Details

Publication date : Apr 29, 2015
Length: 194 pages
Edition : 1st
Language : English
ISBN-13 : 9781784396688
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 142.97
Learning Hadoop 2
$54.99
Mastering Hadoop
$54.99
Hadoop Essentials
$32.99
Total $ 142.97 Stars icon
Banner background image

Table of Contents

8 Chapters
1. Introduction to Big Data and Hadoop Chevron down icon Chevron up icon
2. Hadoop Ecosystem Chevron down icon Chevron up icon
3. Pillars of Hadoop – HDFS, MapReduce, and YARN Chevron down icon Chevron up icon
4. Data Access Components – Hive and Pig Chevron down icon Chevron up icon
5. Storage Component – HBase Chevron down icon Chevron up icon
6. Data Ingestion in Hadoop – Sqoop and Flume Chevron down icon Chevron up icon
7. Streaming and Real-time Analysis – Storm and Spark Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.5
(6 Ratings)
5 star 16.7%
4 star 50%
3 star 16.7%
2 star 0%
1 star 16.7%
Filter icon Filter
Top Reviews

Filter reviews by




Amazon Customer Jan 13, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book covers a good number of Hadoop tools and api. A well-composed compilation of the important need to know knowledge in Big Data.
Amazon Verified review Amazon
David Jun 08, 2015
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
Excellent book to introduce Hadoop. I known the name for years without ever finding time to look deeper at it. The book introduces the major concepts behind and the modules available with, whenever available, alternatives. It's clearly a must read for people new to the concept of big data manipulation with framework like Hadoop. Each pieces from distributed filesystem to data parsing is covered by the book.
Amazon Verified review Amazon
Francesco Corti Jun 17, 2015
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
I think this is a nice book for developers and software architect with no relevant experience in big data and Apache Hadoop. Big data is definitely an IT buzzword but this book tries to make order in the today’s scenario, with an practical an interesting look at the Apache Hadoop implementation. I enjoyed reading this book and I think I will fix in my mind some of the initial descriptions of the introductory part that I find very rational and clear. Nice job, Shiva.
Amazon Verified review Amazon
PJG May 21, 2015
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
This is a very detailed introductory guide to Hadoop. The key question is: how does it rate against the millions of other Hadoop books in existence?At 194 pages, this is a slim volume when compared to "Hadoop: The Definitive Guide", but in terms of content it's suprisingly packed. Chapter 3, which covers HDFS, MapReduce, and YARN, is a good case in point: there's quite a lot of depth here, but the number of diagrams really helps clarify what is a fairly whistle-stop tour through fundamental Hadoop building blocks. Later chapters give a detailed overview of Hbase, Sqoop and Flume and these are useful. Spark and Storm are covered, with a brief note on Lambda architectures.Overall, this is quite a nice reference, although the small page count does reflect a rather terse style and I don't think it would be a particularly easy book to learn from for a newcomer to Hadoop. However, sometimes it's useful to have a rapid overview with just the distilled, essential facts and the book certainly achieves that.
Amazon Verified review Amazon
Ian Stirk Jun 10, 2015
Full star icon Full star icon Full star icon Empty star icon Empty star icon 3
Hi,I have written a detailed chapter-by-chapter review of this book on www DOT i-programmer DOT info, the first and last parts of this review are given here. For my review of all chapters, search i-programmer DOT info for STIRK together with the book's title.This book aims to give you an understanding of Hadoop and some of its major components, explaining how and when to use them, and providing scenarios where they should be used.It is aimed at application and system developers that want to solve practical problems using the Hadoop framework. It is also intended for Hadoop professionals who want to find solutions to the different challenges they come across in their Hadoop projects.A prerequisite is a good understanding of Java programming, additionally, a basic understanding of distributed computing would be helpful.Below is a chapter-by-chapter exploration of the topics covered.Chapter 1 Introduction to Big Data and HadoopThe chapter opens with the emergence of big data systems as a response to the limitations of relational databases (RDBMS), which were unable to process big data in a timely and cost-effective manner.The next section looks at explaining the need for big data systems with reference to the 3 Vs of big data:*Volume (1.8 zettabytes of data created in 2011, 35 zettabytes expected by 2020)*Velocity (data arriving quickly)*Variety (structured and semi-structured data e.g. emails)The chapter continues with a look at the sources of big data, including: monitoring sensors, social media posts, videos/photos, logs etc. Some big data use case patterns are briefly described.Next, Hadoop is examined, being the most popular big data platform. Hadoop is open source, and offers large-scale massively parallel distributed processing. Hadoop has 2 major components: HDFS (Hadoop Distributed File System) - Hadoop’s storage system, and MapReduce – Hadoop’s batch processing model. The section continues with a look at Hadoop’s history, advantages, uses, and related components.The remainder of the chapter provides an overview of the other chapters of the book, namely:*Pillars of Hadoop (HDFS, MapReduce, YARN)*Data access components (Hive, Pig)*Data storage component (HBase)*Data ingestion in Hadoop (Sqoop, Flume)*Streaming and real-time analysis (Storm, Spark)This chapter provides a useful understanding of how big data processing arose, and how Hadoop fulfils this need. There’s a useful overview of the four main types of NoSQL database. There’s a helpful overview of Hadoop, its history, advantages, uses, and associated components. There are plenty of helpful diagrams to aid understanding (as there are in the rest of the book), and a useful introduction to what’s coming in the rest of the book.Sometimes, the English grammar is substandard; this occurs in various sections of the book. Some subsections seem disjointed (okay within themselves, but not part of a wider coherent section) – again this occurs in other parts of the book. There’s a small error relating to the amount of total data created in 2009, the value given is 800GB, the correct value is 800 exabytes or 0.8 zettabytes. All these problems should have been caught by the reviewers/editors....ConclusionThis book aims to give you an understanding of Hadoop and some of its major components, and largely succeeds. For a short book, it covers a wide area. I think only a little understanding of Java (or a comparable language) is needed to read this book. The extensive use of diagrams is helpful.The book should prove useful to developers wanting to know more about Hadoop and its major associated technologies. The book provides a helpful overview of Hadoop, HDFS, MapReduce, YARN, Hive, Pig, HBase, Sqoop, Flume, Storm and Spark. While not comprehensive (e.g. Impala and Hue are not discussed), it does cover many of the popular components.The English grammar in some sections is substandard, making the book awkward to read. An editor with a good understanding of English would improve the book’s readability. Some sentences are illogical e.g. “Hadoop is primarily designed for batch processing and for Lambda Architecture systems.” – But, Lambda Architecture includes batch and stream processing! Additionally, some sections seem muddled – probably amplified by the bad grammar and illogical thought.Overall, if you can bypass the problems, this is a useful book, wide in scope and quite detailed for a short book.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela