This chapter introduces the reader to the world of Hadoop and the core components of Hadoop, namely the Hadoop Distributed File System (HDFS) and MapReduce. We will start by introducing the changes and new features in the Hadoop 3 release. Particularly, we will talk about the new features of HDFS and Yet Another Resource Negotiator (YARN), and changes to client applications. Furthermore, we will also install a Hadoop cluster locally and demonstrate the new features such as erasure coding (EC) and the timeline service. As as quick note, Chapter 10, Visualizing Big Data shows you how to create a Hadoop cluster in AWS.
In a nutshell, the following topics will be covered throughout this chapter:
- HDFS
- High availability
- Intra-DataNode balancer
- EC
- Port mapping
- MapReduce
- Task-level optimization
- YARN
- Opportunistic containers
- Timeline service v.2
- Docker containerization
- Other changes
- Installation of Hadoop 3.1
- HDFS
- YARN
- EC
- Timeline service v.2