Hadoop 2.x Administration Cookbook

Chapter 1. Hadoop Architecture and Deployment

In this chapter, we will cover the following recipes:

Overview of Hadoop Architecture
Building and compiling Hadoop
Installation methods
Setting up host resolution
Installing a single-node cluster - HDFS components
Installing a single-node cluster - YARN components
Installing a multi-node cluster
Configuring Hadoop Gateway node
Decommissioning nodes
Adding nodes to the cluster

Introduction

As Hadoop is a distributed system with many components, and has a reputation of getting quite complex, it is important to understand the basic Architecture before we start with the deployments.

In this chapter, we will take a look at the Architecture and the recipes to deploy a Hadoop cluster in various modes. This chapter will also cover recipes on commissioning and decommissioning nodes in a cluster.

The recipes in this chapter will primarily focus on deploying a cluster based on an Apache Hadoop distribution, as it is the best way to learn and explore Hadoop.

Note

While the recipes in this chapter will give you an overview of a typical configuration, we encourage you to adapt this design according to your needs. The deployment directory structure varies according to IT policies within an organization. All our deployments will be based on the Linux operating system, as it is the most commonly used platform for Hadoop in production. You can use any flavor of Linux; the recipes are very generic in nature and should work on all Linux flavors, with the appropriate changes in path and installation methods, such as yum or apt-get.

Overview of Hadoop Architecture

Hadoop is a framework and not a tool. It is a combination of various components, such as a filesystem, processing engine, data ingestion tools, databases, workflow execution tools, and so on. Hadoop is based on client-server Architecture with a master node for each storage layer and processing layer.

Namenode is the master for Hadoop distributed file system (HDFS) storage and ResourceManager is the master for YARN (Yet Another Resource Negotiator). The Namenode stores the file metadata and the actual blocks/data reside on the slave nodes called Datanodes. All the jobs are submitted to the ResourceManager and it then assigns tasks to its slaves, called NodeManagers. In a highly available cluster, we can have more than one Namenode and ResourceManager.

Both masters are each a single point of failure, which makes them very critical components of the cluster and so care must be taken to make them highly available.

Although there are many concepts to learn, such as application masters, containers, schedulers, and so on, as this is a recipe book, we will keep the theory to a minimum.

Building and compiling Hadoop

The pre-build Hadoop binary available at www.apache.org, is a 32-bit version and is not suitable for the 64-bit hardware as it will not be able to utilize the entire addressable memory. Although, for lab purposes, we can use the 32-bit version, it will keep on giving warnings about the "not being built for the native library", which can be safely ignored.

In production, we will always be running Hadoop on hardware which is a 64-bit version and can support larger amounts of memory. To properly utilize memory higher than 4 GB on any node, we need the 64-bit complied version of Hadoop.

Getting ready

To step through the recipes in this chapter, or indeed the entire book, you will need at least one preinstalled Linux instance. You can use any distribution of Linux, such as Ubuntu, CentOS, or any other Linux flavor that the reader is comfortable with. The recipes are very generic and are expected to work with all distributions, although, as stated before, one may need to use distro-specific commands. For example, for package installation in CentOS we use yum package installer, or in Debian-based systems we use apt-get, and so on. The user is expected to know basic Linux commands and should know how to set up package repositories such as the yum repository. The user should also know how the DNS resolution is configured. No other prerequisites are required.

How to do it...

ssh to the Linux instance using any of the ssh clients. If you are on Windows, you need PuTTY. If you are using a Mac or Linux, there is a default terminal available to use ssh. The following command connects to the host with an IP of 10.0.0.4. Change it to whatever the IP is in your case:
```
$ ssh root@10.0.0.4
```
Change to the user root or any other privileged user:
```
$ sudo su -
```

Install the dependencies to build Hadoop:

# yum install gcc gcc-c++ openssl-devel make cmake jdk-1.7u45(minimum)

Download and install Maven:

wget mirrors.gigenet.com/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz

Untar Maven:

# tar -zxf apache-maven-3.3.9-bin.tar.gz -C /opt/

Set up the Maven environment:

# cat /etc/profile.d/maven.sh
export JAVA_HOME=/usr/java/latest
export M3_HOME=/opt/apache-maven-3.3.9
export PATH=$JAVA_HOME/bin:/opt/apache-maven-3.3.9/bin:$PATH

Download and set up protobuf:

# wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
# tar -xzf protobuf-2.5.0.tar.gz -C /root
# cd /opt/protobuf-2.5.0/
# ./configure
# make;make install

Download the latest Hadoop stable source code. At the time of writing, the latest Hadoop version is 2.7.3:

# wget apache.uberglobalmirror.com/hadoop/common/stable2/hadoop-2.7.3-src.tar.gz
# tar -xzf hadoop-2.7.3-src.tar.gz -C /opt/
# cd /opt/hadoop-2.7.2-src
# mvn package -Pdist,native -DskipTests -Dtar

You will see a tarball in the folder hadoop-2.7.3-src/hadoop-dist/target/.

How it works...

The tarball package created will be used for the installation of Hadoop throughout the book. It is not mandatory to build a Hadoop from source, but by default the binary packages provided by Apache Hadoop are 32-bit versions. For production, it is important to use a 64-bit version so as to fully utilize the memory beyond 4 GB and to gain other performance benefits.

Installing a single-node cluster - HDFS components

Usually the term cluster means a group of machines, but in this recipe, we will be installing various Hadoop daemons on a single node. The single machine will act as both the master and slave for the storage and processing layer.

Getting ready

You will need some information before stepping through this recipe.

Although Hadoop can be configured to run as root user, it is a good practice to run it as a non-privileged user. In this recipe, we are using the node name nn1.cluster1.com, preinstalled with CentOS 6.5.

Tip

Create a system user named hadoop and set a password for that user.

Install JDK, which will be used by Hadoop services. The minimum recommended version of JDK is 1.7, but Open JDK can also be used.

How to do it...

Log into the machine/host as root user and install jdk:

# yum install jdk –y
or it can also be installed using the command as below
# rpm –ivh jdk-1.7u45.rpm

Once Java is installed, make sure Java is in PATH for execution. This can be done by setting JAVA_HOME and exporting it as an environment variable. The following screenshot shows the content of the directory where Java gets installed:
```
# export JAVA_HOME=/usr/java/latest
```
Now we need to copy the tarball hadoop-2.7.3.tar.gz--which was built in the Build Hadoop section earlier in this chapter—to the home directory of the user root. For this, the user needs to login to the node where Hadoop was built and execute the following command:
```
# scp –r hadoop-2.7.3.tar.gz root@nn1.cluster1.com:~/
```
Create a directory named/opt/cluster to be used for Hadoop:
```
# mkdir –p /opt/cluster
```
Then untar the hadoop-2.7.3.tar.gz to the preceding created directory:
```
# tar –xzvf hadoop-2.7.3.tar.gz  -C /opt/Cluster/
```
Create a user named hadoop, if you haven't already, and set the password as hadoop:
```
# useradd hadoop
# echo hadoop | passwd --stdin hadoop
```
As step 6 was done by the root user, the directory and file under /opt/cluster will be owned by the root user. Change the ownership to the Hadoop user:
```
# chown -R hadoop:hadoop /opt/cluster/
```
If the user lists the directory structure under /opt/cluster, he will see it as follows:
The directory structure under /opt/cluster/hadoop-2.7.3 will look like the one shown in the following screenshot:
The listing shows etc, bin, sbin, and other directories.
The etc/hadoop directory is the one that contains the configuration files for configuring various Hadoop daemons. Some of the key files are core-site.xml, hdfs-site.xml, hadoop-env.xml, and mapred-site.xml among others, which will be explained in the later sections:
The directories bin and sbin contain executable binaries, which are used to start and stop Hadoop daemons and perform other operations such as filesystem listing, copying, deleting, and so on:
To execute a command /opt/cluster/hadoop-2.7.3/bin/hadoop, a complete path to the command needs to be specified. This could be cumbersome, and can be avoided by setting the environment variable HADOOP_HOME.
Similarly, there are other variables that need to be set that point to the binaries and the configuration file locations:
The environment file is set up system-wide so that any user can use the commands. Once the hadoopenv.sh file is in place, execute the command to export the variables defined in it:
Change to the Hadoop user using the command su – hadoop:
Change to the /opt/cluster directory and create a symlink:
To verify that the preceding changes are in place, the user can execute either the which Hadoop or which java commands, or the user can execute the command hadoop directly without specifying the complete path.
In addition to setting the environment as discussed, the user has to add the JAVA_HOME variable in the hadoop-env.sh file.
The next thing is to set up the Namenode address, which specifies the host:port address on which it will listen. This is done using the file core-site.xml:
The important thing to keep in mind is the property fs.defaultFS.
The next thing that the user needs to configure is the location where Namenode will store its metadata. This can be any location, but it is recommended that you always have a dedicated disk for it. This is configured in the file hdfs-site.xml:
The next step is to format the Namenode. This will create an HDFS file system:
```
$ hdfs namenode -format
```
Similarly, we have to add the rule for the Datanode directory under hdfs-site.xml. Nothing needs to be done to the core-site.xml file:

Then the services need to be started for Namenode and Datanode:

$ hadoop-daemon.sh start namenode
$ hadoop-daemon.sh start datanode

The command jps can be used to check for running daemons:

How it works...

The master Namenode stores metadata and the slave node Datanode stores the blocks. When the Namenode is formatted, it creates a data structure that contains fsimage, edits, and VERSION. These are very important for the functioning of the cluster.

The parameters dfs.data.dir and dfs.datanode.data.dir are used for the same purpose, but are used across different versions. The older parameters are deprecated in favor of the newer ones, but they will still work. The parameter dfs.name.dir has been deprecated in favor of dfs.namenode.name.dir in Hadoop 2.x. The intention of showing both versions of the parameter is to bring to the user's notice that parameters are evolving and ever changing, and care must be taken by referring to the release notes for each Hadoop version.

There's more...

Setting up ResourceManager and NodeManager

In the preceding recipe, we set up the storage layer—that is, the HDFS for storing data—but what about the processing layer?. The data on HDFS needs to be processed to make a meaningful decision using MapReduce, Tez, Spark, or any other tool. To run the MapReduce, Spark or other processing framework we need to have ResourceManager, NodeManager.

Installing a single-node cluster - YARN components

In the previous recipe, we discussed how to set up Namenode and Datanode for HDFS. In this recipe, we will be covering how to set up YARN on the same node.

After completing this recipe, there will be four daemons running on the nn1.cluster1.com node, namely namenode, datanode, resourcemanager, and nodemanager daemons.

Getting ready

For this recipe, you will again use the same node on which we have already configured the HDFS layer.

All operations will be done by the hadoop user.

How to do it...

Log in to the node nn1.cluster1.com and change to the hadoop user.
Change to the /opt/cluster/hadoop/etc/hadoop directory and configure the files mapred-site.xml and yarn-site.xml:
The file yarn-site.xml specifies the shuffle class, scheduler, and resource management components of the ResourceManager. You only need to specify yarn.resourcemanager.address; the rest are automatically picked up by the ResourceManager. You can see from the following screenshot that you can separate them into their independent components:
Once the configurations are in place, the resourcemanager and nodemanager daemons need to be started:
The environment variables that were defined by /etc/profile.d/hadoopenv.sh included YARN_HOME and YARN_CONF_DIR, which let the framework know about the location of the YARN configurations.

How it works...

The nn1.cluster1.com node is configured to run HDFS and YARN components. Any file that is copied to the HDFS will be split into blocks and stored on Datanode. The metadata of the file will be on the Namenode.

Any operation performed on a text file, such as word count, can be done by running a simple MapReduce program, which will be submitted to the single node cluster using the ResourceManager daemon and executed by the NodeManager. There are a lot of steps and details as to what goes on under the hood, which will be covered in the coming chapters.

Note

The single-node cluster is also called pseudo-distributed cluster.

There's more...

A quick check can be done on the functionality of HDFS. You can create a simple text file and upload it to HDFS to see whether it is successful or not:

$ hadoop fs –put test.txt /

This will copy the file test.txt to the HDFS. The file can be read directly from HDFS:

$ hadoop fs –ls /
$ hadoop fs –cat /test.txt

Installing a multi-node cluster

In the previous recipes, we looked at how to configure a single-node Hadoop cluster, also referred to as pseudo-distributed cluster. In this recipe, we will set up a fully distributed cluster with each daemon running on separate nodes.

There will be one node for Namenode, one for ResourceManager, and four nodes will be used for Datanode and NodeManager. In production, the number of Datanodes could be in the thousands, but here we are just restricted to four nodes. The Datanode and NodeManager coexist on the same nodes for the purposes of data locality and locality of reference.

Getting ready

Make sure that the six nodes the user chooses have JDK installed, with name resolution working. This could be done by making entries in the /etc/hosts file or using DNS.

In this recipe, we are using the following nodes:

Namenode: nn1.cluster1.com
ResourceManager: jt1.cluster1.com
Datanodes and NodeManager: dn[1-4].cluster1.com

How to do it...

Make sure all the nodes have the hadoop user.
Create the directory structure /opt/cluster on all the nodes.
Make sure the ownership is correct for /opt/cluster.
Copy the /opt/cluster/hadoop-2.7.3 directory from the nn1.cluster.com to all the nodes in the cluster:
```
$ for i in 192.168.1.{72..75};do scp -r hadoop-2.7.3 $i:/opt/cluster/ $i;done
```
The preceding IPs belong to the nodes in the cluster. The user needs to modify them accordingly. Also, to prevent it from prompting for password for each node, it is good to set up pass phraseless access between each node.
Change to the directory /opt/cluster and create a symbolic link on each node:
```
$ ln –s hadoop-2.7.3 hadoop
```
Make sure that the environment variables have been set up on all nodes:
```
$ . /etc/profile.d/hadoopenv.sh
```
On Namenode, only the parameters specific to it are needed.
The file core-site.xml remains the same across all nodes in the cluster.
On Namenode, the file hdfs-site.xml changes as follows:
On Datanode, the file hdfs-site.xml changes as follows:
On Datanodes, the file yarn-site.xml changes as follows:
On the node jt1, which is ResourceManager, the file yarn-site.xml is as follows:
To start Namenode on nn1.cluster1.com, enter the following:
```
$ hadoop-daemon.sh start namenode
```

To start Datanode and NodeManager on dn[1-4], enter the following:

$ hadoop-daemon.sh start datanode
$ yarn-daemon.sh start nodemanager

To start ResourceManager on jt1.cluster.com, enter the following:
```
$ yarn-daemon.sh start resourcemanager
```
On each node, execute the command jps to see the daemons running on them. Make sure you have the correct services running on each node.
Create a text file test.txt and copy it to HDFS using hadoop fs –put test.txt /. This confirms that HDFS is working fine.
To verify that YARN has been set up correctly, run the simple "Pi" estimation program:
```
$ yarn jar /opt/cluster/hadoop/share/hadoop/mapreduce/hadoop-example.jar Pi 3 3
```

How it works...

Steps 1 through 7 copy the already extracted and configured Hadoop files to other nodes in the cluster. From step 8 onwards, each node is configured according to the role it plays in the cluster.

The user should see four Datanodes reporting to the cluster, and should also be able to access the UI of the Namenode on port 50070 and on port 8088 for ResourceManager.

To see the number of nodes talking to Namenode, enter the following:

$ hdfs dfsadmin -report
  Configured Capacity: 9124708352 (21.50 GB)
  Present Capacity: 5923942400 (20.52 GB)
  DFS Remaining: 5923938304 (20.52 GB)
  DFS Used: 4096 (4 KB)
  DFS Used%: 0.00%
Live datanodes (4):

The same information can also be retrieved using the Namenode Web UI as shown in the following screenshot:

Note

The user can configure any customer port for any service, but there should be a good reason to change the defaults.

Configuring the Hadoop Gateway node

Hadoop Gateway or edge node is a node that connects to the Hadoop cluster, but does not run any of the daemons. The purpose of an edge node is to provide an access point to the cluster and prevent users from a direct connection to critical components such as Namenode or Datanode.

Another important reason for its use is the data distribution across the cluster. If a user connects to a Datanode and performs the data copy operation hadoop fs –put file /, then one copy of the file will always go to the Datanode from which the copy command was executed. This will result in an imbalance of data across the node. If we upload a file from a node that is not a Datanode, then data will be distributed evenly for all copies of data.

In this recipe, we will configure an edge node for a Hadoop cluster.

Getting ready

For the edge node, the user needs a separate Linux machine with Java installed and the user hadoop in place.

How to do it...

ssh to the new node that is to be configured as Gateway node. For example, the node name could be client1.cluster1.com.
Set up the environment variable as discussed before. This can be done by setting the /etc/profile.d/hadoopenv.sh file.
Copy the already configured directory hadoop-2.7.3 from Namenode to this node (client1.cluster1.com). This avoids doing all the configuration for files such as core-site.xml and yarn-site.xml.
The edge node just needs to know about the two master nodes of Namenode and ResourceManager. It does not need any other configuration for the time being. It does not store any data locally, unlike Namenode and Datanode.
It only needs to write temporary files and logs. In later chapters, we will see other parameters for MapReduce and performance tuning that go on this node.
Create a symbolic link ln –s hadoop-2.7.3 hadoop so that the commands and Hadoop configuration files are visible.
There will be no daemon started on this node. Execute a command from the edge node to make sure the user can connect to hadoop fs –ls /.
To verify that the edge node has been set up correctly, run the simple "Pi" estimation program from the edge node:
```
$ yarn jar /opt/cluster/hadoop/share/hadoop/mapreduce/hadoop-example.jar Pi 3 3
```

How it works...

The edge node or the Gateway node connects to Namenode for all HDFS-related operation and connects to ResourceManager for submitted jobs to the cluster.

In production, there will be more than one edge node connecting to the cluster for high availability. This is can be done by using a load balancer or DNS round-robin. No user should run any local jobs on the edge nodes or use it for doing non Hadoop-related tasks.

Decommissioning nodes

There will always be failures in clusters, such as hardware issues or a need to upgrade nodes. This should be done in a graceful manner, without any data loss.

When the Datanode daemon is stopped on a Datanode, it takes approximately ten minutes for the Namenode to remove that node. This has to do with the heartbeat retry interval. At any time, we can abruptly remove the Datanode, but it can result in data loss.

It is recommended that you opt for the graceful removal of the node from the cluster, as this ensures that all the data on that node is drained.

Getting ready

For the following steps, we assume that the cluster that is up and running with Datanodes is in a healthy state and the one with the Datanode dn1.cluster1.com needs maintenance and must be removed from the cluster. We will login to the Namenode and make changes there.

How to do it...

ssh to Namenode and edit the file hdfs-site.xml by adding the following property to it:

<property>
<name>dfs.hosts.exclude</name>
<value>/home/hadoop/excludes</value>
<final>true</final>
</property>

Make sure the file excludes is readable by the user hadoop.

Restart the Namenode daemon for the property to take effect:

$ hadoop-daemons.sh stop namenode
$ hadoop-daemons.sh start namenode

A restart of Namenode is required only when any property is changed in the file. Once the property is in place, Namenode can read the changes to the contents of the file excludes by simply refreshing nodes.
Add the dn1.cluster1.com node to the file excludes:
```
$ cat excludes
dn1.cluster1.com
```
After adding the node to the file, we just need to reload the file by doing the following:
```
$ hadoop dfsadmin -refreshNodes
```
After sometime, the node will be decommissioned. The time will vary according to the data the particular Datanode had. We can see the decommissioned nodes using the following:
```
$ hdfs dfsadmin -report
```
The preceding command will list the nodes in the cluster, and against the dn1.cluster1.com node we can see that its status will either be decommissioning or decommissioned.

How it works...

Let's have a look at what we did throughout this recipe.

In steps 1 through 6, we added the new property to the hdfs-site.xml file and then restarted Namenode to make it aware of the changes. Once the property is in place, the Namenode is aware of the excludes file, and it can be asked to re-read by simply refreshing the node list, as done in step 6.

With these steps, the data on the Datanode dn1.cluster1.com will be moved to other nodes in the cluster, and once the data has been drained, the Datanode daemon on the node will be shutdown. During the process, the node will change the status from normal to decommissioning and then to decommissioned.

Care must be taken while decommissioning nodes in the cluster. The user should not decommission multiple nodes at a time as this will generate lot of network traffic and cause congestion and data loss.

Adding nodes to the cluster

Over a period of time, our cluster will grow in data and there will be a need to increase the capacity of the cluster by adding more nodes.

We can add Datanodes to the cluster in the same way that we first configured the Datanode started the Datanode daemon on it. But the important thing to keep in mind is that all nodes can be part of the cluster. It should not be that anyone can just start a Datanode daemon on his laptop and join the cluster, as it will be disastrous. By default, there is nothing preventing any node being a Datanode, as the user has just to untar the Hadoop package and point the file "core-site.xml" to the Namenode and start the Datanode daemon.

Getting ready

For the following steps, we assume that the cluster that is up and running with Datanodes is in a healthy state and we need to add a new Datanode in the cluster. We will login to the Namenode and make changes there.

How to do it...

ssh to Namenode and edit the file hdfs-site.xml to add the following property to it:

<property>
<name>dfs.hosts</name>
<value>/home/hadoop/includes</value>
<final>true</final>
</property>

Make sure the file includes is readable by the user hadoop.

Restart the Namenode daemon for the property to take effect:

$ hadoop-daemons.sh stop namenode
$ hadoop-daemons.sh start namenode

A restart of Namenode is required only when any property is changed in the file. Once the property is in place, Namenode can read the changes to the contents of the includes file by simply refreshing the nodes.
Add the dn1.cluster1.com node to the file excludes:
```
$ cat includes
dn1.cluster1.com
```
The file includes or excludes can contain a list of multiple nodes, one node per line.
After adding the node to the file, we just need to reload the file by entering the following:
```
$ hadoop dfsadmin -refreshNodes
```
After some time, the node will be available in the cluster and can be seen:
```
$ hdfs dfsadmin -report
```

How it works...

The file /home/hadoop/includes will contain a list of all the Datanodes that are allowed to join a cluster. If the file includes is blank, then all Datanodes are allowed to join the cluster. If there is both an include and exclude file, the list of nodes must be mutually exclusive in both the files. So, to decommission the node dn1.cluster.com from the cluster, it must be removed from the includes file and added to the excludes file.

There's more...

In addition to controlling the nodes as we described, there will be firewall rules in place and separate VLANs for Hadoop clusters to keep the traffic and data isolated.

Jayesh Jun 02, 2018

Writer has fair knowledge on subject.

Amazon Verified review

Zvonko Pino Varela Apr 29, 2018

It looks out of date. I followed the instructions step by step but I found a lot of mistakes, and many instructions didn't work. It's far better to install Apache Ambari and follow those directions.