Search icon CANCEL
Subscription
0
Cart icon
Cart
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Mastering Hadoop 3

You're reading from  Mastering Hadoop 3

Product type Book
Published in Feb 2019
Publisher Packt
ISBN-13 9781788620444
Pages 544 pages
Edition 1st Edition
Languages
Authors (2):
Chanchal Singh Chanchal Singh
Profile icon Chanchal Singh
Manish Kumar Manish Kumar
Profile icon Manish Kumar
View More author details
Toc

Table of Contents (23) Chapters close

Title Page
Dedication
About Packt
Foreword
Contributors
Preface
1. Journey to Hadoop 3 2. Deep Dive into the Hadoop Distributed File System 3. YARN Resource Management in Hadoop 4. Internals of MapReduce 5. SQL on Hadoop 6. Real-Time Processing Engines 7. Widely Used Hadoop Ecosystem Components 8. Designing Applications in Hadoop 9. Real-Time Stream Processing in Hadoop 10. Machine Learning in Hadoop 11. Hadoop in the Cloud 12. Hadoop Cluster Profiling 13. Who Can Do What in Hadoop 14. Network and Data Security 15. Monitoring Hadoop 1. Other Books You May Enjoy Index

Hadoop logical view


The Hadoop Logical view can be divided into multiple sections. These sections can be viewed as a logical sequence, with steps starting from Ingress/Egress and ending at Data Storage Medium.

The following diagram shows the Hadoop platform logical view:

We will touch upon these sections as shown in the preceding diagram one by one, to understand them. However, when designing any Hadoop application, you should think in terms of those sections and make technological choices according to the use case problems you are trying to solve. Without wasting time, let's look at these sections one by one:

  • Ingress/egress/processing: Any interaction with the Hadoop platform should be viewed in terms of the following:
    • Ingesting (ingress) data 
    • Reading (Egress) data 
    • Processing already ingested data

These actions can be automated via the use of tools or automated code. This can be achieved by user actions, by either uploading data to Hadoop or downloading data from Hadoop. Sometimes, users trigger actions that may result in Ingress/egress or the processing of data.

  • Data integration components: For ingress/egress or data processing in Hadoop, you need data integration components. These components are tools, software, or custom code that help integrate the underlying Hadoop data with user views or actions. If we talk about the user perspective alone, then these components give end users a unified view of data in Hadoop across different distributed Hadoop folders, in different files and data formats. These components provide end users and applications with an entry point for using or manipulating Hadoop data using different data access interfaces and data processing engines. We will exlpore the definition of data access interfaces and processing engines in the next section. In a nutshell, tools such as Hue and software (libraries) such as Sqoop, Java Hadoop Clients, and Hive Beeline Clients are some examples of data integration components.
  • Data access interfaces: Data access interfaces allow you to access underlying Hadoop data using different languages such as SQL, NoSQL, or APIs such as Rest and JAVA APIs, or using different data formats such as search data formats and streams. Sometimes, the interface that you use to access data from Hadoop is tightly coupled with underlying data processing engines. For example, if you're using SPARK SQL then it is bound to use the SPARK processing engine. Something similar is true in the case of the SEARCH interface, which is bound to use search engines such as SOLR or elastic search.
  • Data Processing Engines: Hadoop as a platform provides different processing engines to manipulate underlying data. These processing engines have different mechanisms to use system resources and have completely different SLA guarantees. For example, the MapReduce processing engine is more disk I/O-bound (keeping RAM memory usage under control) and it is suitable for batch-oriented data processing. Similarly, SPARK in a memory processing engine is less disk I/O-bound and more dependent on RAM memory. It is more suitable for stream or micro-batch processing. You should choose processing engines for your application based on the type of data sources you are dealing with along with SLAs you need to satisfy.
  • Resource management frameworks: Resource management frameworks expose abstract APIs to interact with underlying resource managers for task and job scheduling in Hadoop. These frameworks ensure there is a set of steps to follow for submitting jobs in Hadoop using designated resource managers such as YARN or MESOS. These frameworks help establish optimal performance by utilizing underlying resources systematically. Examples of such frameworks are Tez or Slider. Sometimes, data processing engines use these frameworks to interact with underlying resource managers or they have their own set of custom libraries to do so.
  • Task and resource management: Task and resource managment has one primary goal: sharing a large cluster of machines across different, simultaneously running applications in a cluster. There are two major resource managers in Hadoop: YARN and MESOS. Both are built with the same goal, but they use different scheduling or resource allocation mechanisms for jobs in Hadoop. For example, YARN is a Unix process while MESOS is Linux-container-based.
  • Data input/output: The data input/output layer is primarily responsible for different file formats, compression techniques, and data serialization for Hadoop storage.
  • Data Storage Medium: HDFS is the primary data storage medium used in Hadoop. It is a Java-based, high-performant distributed filesystem that is based on the underlying UNIX File System. In the next section, we will study Hadoop distributions along with their benefits.

 

 

You have been reading a chapter from
Mastering Hadoop 3
Published in: Feb 2019 Publisher: Packt ISBN-13: 9781788620444
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at ₹800/month. Cancel anytime