You're reading from Programming MapReduce with Scalding A practical guide to designing, testing, and implementing complex MapReduce applications in Scala

Product type Paperback

Published in Jun 2014

Publisher

ISBN-13 9781783287017

Length 148 pages

Edition 1st Edition

Languages

Scala

Tools

Hadoop

Concepts

Front End Web Development

Author (1):

Antonios Chalkiopoulos

View More author details

Table of Contents (11) Chapters

Preface

1. Introduction to MapReduce FREE CHAPTER

2. Get Ready for Scalding

3. Scalding by Example

4. Intermediate Examples

5. Scalding Design Patterns

6. Testing and TDD

7. Running Scalding in Production

8. Using External Data Stores

9. Matrix Calculations and Machine Learning

Index

The Hadoop platform

Hadoop can be used for a lot of things. However, when you break it down to its core parts, the primary features of Hadoop are Hadoop Distributed File System (HDFS) and MapReduce.

HDFS stores read-only files by splitting them into large blocks and distributing and replicating them across a Hadoop cluster. Two services are involved with the filesystem. The first service, the NameNode acts as a master and keeps the directory tree of all file blocks that exist in the filesystem and tracks where the file data is kept across the cluster. The actual data of the files is stored in multiple DataNode nodes, the second service.

MapReduce is a programming model for processing large datasets with a parallel, distributed algorithm in a cluster. The most prominent trait of Hadoop is that it brings processing to the data; so, MapReduce executes tasks closest to the data as opposed to the data travelling to where the processing is performed. Two services are involved in a job execution. A job is submitted to the service JobTracker, which first discovers the location of the data. It then orchestrates the execution of the map and reduce tasks. The actual tasks are executed in multiple TaskTracker nodes.

Hadoop handles infrastructure failures such as network issues, node, or disk failures automatically. Overall, it provides a framework for distributed storage within its distributed file system and execution of jobs. Moreover, it provides the service ZooKeeper to maintain configuration and distributed synchronization.

Many projects surround Hadoop and complete the ecosystem of available Big Data processing tools such as utilities to import and export data, NoSQL databases, and event/real-time processing systems. The technologies that move Hadoop beyond batch processing focus on in-memory execution models. Overall multiple projects, from batch to hybrid and real-time execution exist.

You're reading from Programming MapReduce with Scalding A practical guide to designing, testing, and implementing complex MapReduce applications in Scala

Table of Contents (11) Chapters

The Hadoop platform

Authors (1)

Personalised recommendations for you