Packt+ | Advance your knowledge in tech

You're reading from Mastering Hadoop 3

Product type Book

Published in Feb 2019

Publisher Packt

ISBN-13 9781788620444

Pages 544 pages

Edition 1st Edition

Languages

Java

Concepts

Data Processing

Authors (2):

Chanchal Singh

Manish Kumar

View More author details

Table of Contents (23) Chapters

Title Page

Dedication

About Packt

Foreword

Contributors

Preface

1. Journey to Hadoop 3

2. Deep Dive into the Hadoop Distributed File System

3. YARN Resource Management in Hadoop

4. Internals of MapReduce

5. SQL on Hadoop

6. Real-Time Processing Engines

7. Widely Used Hadoop Ecosystem Components

8. Designing Applications in Hadoop

9. Real-Time Stream Processing in Hadoop

10. Machine Learning in Hadoop

11. Hadoop in the Cloud

12. Hadoop Cluster Profiling

13. Who Can Do What in Hadoop

14. Network and Data Security

15. Monitoring Hadoop

1. Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Hadoop logical view

The Hadoop Logical view can be divided into multiple sections. These sections can be viewed as a logical sequence, with steps starting from Ingress/Egress and ending at Data Storage Medium.

The following diagram shows the Hadoop platform logical view:

We will touch upon these sections as shown in the preceding diagram one by one, to understand them. However, when designing any Hadoop application, you should think in terms of those sections and make technological choices according to the use case problems you are trying to solve. Without wasting time, let's look at these sections one by one:

Ingress/egress/processing: Any interaction with the Hadoop platform should be viewed in terms of the following:
- Ingesting (ingress) data
- Reading (Egress) data
- Processing already ingested data

These actions can be automated via the use of tools or automated code. This can be achieved by user actions, by either uploading data to Hadoop or downloading data from Hadoop. Sometimes, users trigger actions that may result in Ingress/egress or the processing of data.

Data integration components: For ingress/egress or data processing in Hadoop, you need data integration components. These components are tools, software, or custom code that help integrate the underlying Hadoop data with user views or actions. If we talk about the user perspective alone, then these components give end users a unified view of data in Hadoop across different distributed Hadoop folders, in different files and data formats. These components provide end users and applications with an entry point for using or manipulating Hadoop data using different data access interfaces and data processing engines. We will exlpore the definition of data access interfaces and processing engines in the next section. In a nutshell, tools such as Hue and software (libraries) such as Sqoop, Java Hadoop Clients, and Hive Beeline Clients are some examples of data integration components.
Data access interfaces: Data access interfaces allow you to access underlying Hadoop data using different languages such as SQL, NoSQL, or APIs such as Rest and JAVA APIs, or using different data formats such as search data formats and streams. Sometimes, the interface that you use to access data from Hadoop is tightly coupled with underlying data processing engines. For example, if you're using SPARK SQL then it is bound to use the SPARK processing engine. Something similar is true in the case of the SEARCH interface, which is bound to use search engines such as SOLR or elastic search.
Data Processing Engines: Hadoop as a platform provides different processing engines to manipulate underlying data. These processing engines have different mechanisms to use system resources and have completely different SLA guarantees. For example, the MapReduce processing engine is more disk I/O-bound (keeping RAM memory usage under control) and it is suitable for batch-oriented data processing. Similarly, SPARK in a memory processing engine is less disk I/O-bound and more dependent on RAM memory. It is more suitable for stream or micro-batch processing. You should choose processing engines for your application based on the type of data sources you are dealing with along with SLAs you need to satisfy.
Resource management frameworks: Resource management frameworks expose abstract APIs to interact with underlying resource managers for task and job scheduling in Hadoop. These frameworks ensure there is a set of steps to follow for submitting jobs in Hadoop using designated resource managers such as YARN or MESOS. These frameworks help establish optimal performance by utilizing underlying resources systematically. Examples of such frameworks are Tez or Slider. Sometimes, data processing engines use these frameworks to interact with underlying resource managers or they have their own set of custom libraries to do so.
Task and resource management: Task and resource managment has one primary goal: sharing a large cluster of machines across different, simultaneously running applications in a cluster. There are two major resource managers in Hadoop: YARN and MESOS. Both are built with the same goal, but they use different scheduling or resource allocation mechanisms for jobs in Hadoop. For example, YARN is a Unix process while MESOS is Linux-container-based.
Data input/output: The data input/output layer is primarily responsible for different file formats, compression techniques, and data serialization for Hadoop storage.
Data Storage Medium: HDFS is the primary data storage medium used in Hadoop. It is a Java-based, high-performant distributed filesystem that is based on the underlying UNIX File System. In the next section, we will study Hadoop distributions along with their benefits.