You're reading from Hadoop 2.x Administration Cookbook Administer and maintain large Apache Hadoop clusters

Product type Paperback

Published in May 2017

Publisher Packt

ISBN-13 9781787126732

Length 348 pages

Edition 1st Edition

Tools

Hadoop

Concepts

System Administration

Author (1):

Aman Singh

View More author details

Table of Contents (14) Chapters

Preface

1. Hadoop Architecture and Deployment FREE CHAPTER

2. Maintaining Hadoop Cluster HDFS

3. Maintaining Hadoop Cluster – YARN and MapReduce

4. High Availability

5. Schedulers

6. Backup and Recovery

7. Data Ingestion and Workflow

8. Performance Tuning

9. HBase Administration

10. Cluster Planning

11. Troubleshooting, Diagnostics, and Best Practices

12. Security

Index

Installing a single-node cluster - HDFS components

Usually the term cluster means a group of machines, but in this recipe, we will be installing various Hadoop daemons on a single node. The single machine will act as both the master and slave for the storage and processing layer.

Getting ready

You will need some information before stepping through this recipe.

Although Hadoop can be configured to run as root user, it is a good practice to run it as a non-privileged user. In this recipe, we are using the node name nn1.cluster1.com, preinstalled with CentOS 6.5.

Tip

Create a system user named hadoop and set a password for that user.

Install JDK, which will be used by Hadoop services. The minimum recommended version of JDK is 1.7, but Open JDK can also be used.

How to do it...

Log into the machine/host as root user and install jdk:

# yum install jdk –y
or it can also be installed using the command as below
# rpm –ivh jdk-1.7u45.rpm

Once Java is installed, make sure Java is in PATH for execution. This can be done by setting JAVA_HOME and exporting it as an environment variable. The following screenshot shows the content of the directory where Java gets installed:
```
# export JAVA_HOME=/usr/java/latest
```
Now we need to copy the tarball hadoop-2.7.3.tar.gz--which was built in the Build Hadoop section earlier in this chapter—to the home directory of the user root. For this, the user needs to login to the node where Hadoop was built and execute the following command:
```
# scp –r hadoop-2.7.3.tar.gz root@nn1.cluster1.com:~/
```
Create a directory named/opt/cluster to be used for Hadoop:
```
# mkdir –p /opt/cluster
```
Then untar the hadoop-2.7.3.tar.gz to the preceding created directory:
```
# tar –xzvf hadoop-2.7.3.tar.gz  -C /opt/Cluster/
```
Create a user named hadoop, if you haven't already, and set the password as hadoop:
```
# useradd hadoop
# echo hadoop | passwd --stdin hadoop
```
As step 6 was done by the root user, the directory and file under /opt/cluster will be owned by the root user. Change the ownership to the Hadoop user:
```
# chown -R hadoop:hadoop /opt/cluster/
```
If the user lists the directory structure under /opt/cluster, he will see it as follows:
The directory structure under /opt/cluster/hadoop-2.7.3 will look like the one shown in the following screenshot:
The listing shows etc, bin, sbin, and other directories.
The etc/hadoop directory is the one that contains the configuration files for configuring various Hadoop daemons. Some of the key files are core-site.xml, hdfs-site.xml, hadoop-env.xml, and mapred-site.xml among others, which will be explained in the later sections:
The directories bin and sbin contain executable binaries, which are used to start and stop Hadoop daemons and perform other operations such as filesystem listing, copying, deleting, and so on:
To execute a command /opt/cluster/hadoop-2.7.3/bin/hadoop, a complete path to the command needs to be specified. This could be cumbersome, and can be avoided by setting the environment variable HADOOP_HOME.
Similarly, there are other variables that need to be set that point to the binaries and the configuration file locations:
The environment file is set up system-wide so that any user can use the commands. Once the hadoopenv.sh file is in place, execute the command to export the variables defined in it:
Change to the Hadoop user using the command su – hadoop:
Change to the /opt/cluster directory and create a symlink:
To verify that the preceding changes are in place, the user can execute either the which Hadoop or which java commands, or the user can execute the command hadoop directly without specifying the complete path.
In addition to setting the environment as discussed, the user has to add the JAVA_HOME variable in the hadoop-env.sh file.
The next thing is to set up the Namenode address, which specifies the host:port address on which it will listen. This is done using the file core-site.xml:
The important thing to keep in mind is the property fs.defaultFS.
The next thing that the user needs to configure is the location where Namenode will store its metadata. This can be any location, but it is recommended that you always have a dedicated disk for it. This is configured in the file hdfs-site.xml:
The next step is to format the Namenode. This will create an HDFS file system:
```
$ hdfs namenode -format
```
Similarly, we have to add the rule for the Datanode directory under hdfs-site.xml. Nothing needs to be done to the core-site.xml file:

Then the services need to be started for Namenode and Datanode:

$ hadoop-daemon.sh start namenode
$ hadoop-daemon.sh start datanode

The command jps can be used to check for running daemons:

How it works...

The master Namenode stores metadata and the slave node Datanode stores the blocks. When the Namenode is formatted, it creates a data structure that contains fsimage, edits, and VERSION. These are very important for the functioning of the cluster.

The parameters dfs.data.dir and dfs.datanode.data.dir are used for the same purpose, but are used across different versions. The older parameters are deprecated in favor of the newer ones, but they will still work. The parameter dfs.name.dir has been deprecated in favor of dfs.namenode.name.dir in Hadoop 2.x. The intention of showing both versions of the parameter is to bring to the user's notice that parameters are evolving and ever changing, and care must be taken by referring to the release notes for each Hadoop version.

There's more...

Setting up ResourceManager and NodeManager

In the preceding recipe, we set up the storage layer—that is, the HDFS for storing data—but what about the processing layer?. The data on HDFS needs to be processed to make a meaningful decision using MapReduce, Tez, Spark, or any other tool. To run the MapReduce, Spark or other processing framework we need to have ResourceManager, NodeManager.