Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Hadoop 2.x Administration Cookbook
Hadoop 2.x Administration Cookbook

Hadoop 2.x Administration Cookbook: Administer and maintain large Apache Hadoop clusters

eBook
€8.99 €32.99
Paperback
€41.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Hadoop 2.x Administration Cookbook

Chapter 1. Hadoop Architecture and Deployment

In this chapter, we will cover the following recipes:

  • Overview of Hadoop Architecture
  • Building and compiling Hadoop
  • Installation methods
  • Setting up host resolution
  • Installing a single-node cluster - HDFS components
  • Installing a single-node cluster - YARN components
  • Installing a multi-node cluster
  • Configuring Hadoop Gateway node
  • Decommissioning nodes
  • Adding nodes to the cluster

Introduction

As Hadoop is a distributed system with many components, and has a reputation of getting quite complex, it is important to understand the basic Architecture before we start with the deployments.

In this chapter, we will take a look at the Architecture and the recipes to deploy a Hadoop cluster in various modes. This chapter will also cover recipes on commissioning and decommissioning nodes in a cluster.

The recipes in this chapter will primarily focus on deploying a cluster based on an Apache Hadoop distribution, as it is the best way to learn and explore Hadoop.

Note

While the recipes in this chapter will give you an overview of a typical configuration, we encourage you to adapt this design according to your needs. The deployment directory structure varies according to IT policies within an organization. All our deployments will be based on the Linux operating system, as it is the most commonly used platform for Hadoop in production. You can use any flavor of Linux; the recipes are very generic in nature and should work on all Linux flavors, with the appropriate changes in path and installation methods, such as yum or apt-get.

Overview of Hadoop Architecture

Hadoop is a framework and not a tool. It is a combination of various components, such as a filesystem, processing engine, data ingestion tools, databases, workflow execution tools, and so on. Hadoop is based on client-server Architecture with a master node for each storage layer and processing layer.

Namenode is the master for Hadoop distributed file system (HDFS) storage and ResourceManager is the master for YARN (Yet Another Resource Negotiator). The Namenode stores the file metadata and the actual blocks/data reside on the slave nodes called Datanodes. All the jobs are submitted to the ResourceManager and it then assigns tasks to its slaves, called NodeManagers. In a highly available cluster, we can have more than one Namenode and ResourceManager.

Both masters are each a single point of failure, which makes them very critical components of the cluster and so care must be taken to make them highly available.

Although there are many concepts to learn, such as application masters, containers, schedulers, and so on, as this is a recipe book, we will keep the theory to a minimum.

Building and compiling Hadoop

The pre-build Hadoop binary available at www.apache.org, is a 32-bit version and is not suitable for the 64-bit hardware as it will not be able to utilize the entire addressable memory. Although, for lab purposes, we can use the 32-bit version, it will keep on giving warnings about the "not being built for the native library", which can be safely ignored.

In production, we will always be running Hadoop on hardware which is a 64-bit version and can support larger amounts of memory. To properly utilize memory higher than 4 GB on any node, we need the 64-bit complied version of Hadoop.

Getting ready

To step through the recipes in this chapter, or indeed the entire book, you will need at least one preinstalled Linux instance. You can use any distribution of Linux, such as Ubuntu, CentOS, or any other Linux flavor that the reader is comfortable with. The recipes are very generic and are expected to work with all distributions, although, as stated before, one may need to use distro-specific commands. For example, for package installation in CentOS we use yum package installer, or in Debian-based systems we use apt-get, and so on. The user is expected to know basic Linux commands and should know how to set up package repositories such as the yum repository. The user should also know how the DNS resolution is configured. No other prerequisites are required.

How to do it...

  1. ssh to the Linux instance using any of the ssh clients. If you are on Windows, you need PuTTY. If you are using a Mac or Linux, there is a default terminal available to use ssh. The following command connects to the host with an IP of 10.0.0.4. Change it to whatever the IP is in your case:
    $ ssh root@10.0.0.4
    
  2. Change to the user root or any other privileged user:
    $ sudo su -
    
  3. Install the dependencies to build Hadoop:
    # yum install gcc gcc-c++ openssl-devel make cmake jdk-1.7u45(minimum)
    
  4. Download and install Maven:
    wget mirrors.gigenet.com/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
    
  5. Untar Maven:
    # tar -zxf apache-maven-3.3.9-bin.tar.gz -C /opt/
    
  6. Set up the Maven environment:
    # cat /etc/profile.d/maven.sh
    export JAVA_HOME=/usr/java/latest
    export M3_HOME=/opt/apache-maven-3.3.9
    export PATH=$JAVA_HOME/bin:/opt/apache-maven-3.3.9/bin:$PATH
    
  7. Download and set up protobuf:
    # wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
    # tar -xzf protobuf-2.5.0.tar.gz -C /root
    # cd /opt/protobuf-2.5.0/
    # ./configure
    # make;make install
    
  8. Download the latest Hadoop stable source code. At the time of writing, the latest Hadoop version is 2.7.3:
    # wget apache.uberglobalmirror.com/hadoop/common/stable2/hadoop-2.7.3-src.tar.gz
    # tar -xzf hadoop-2.7.3-src.tar.gz -C /opt/
    # cd /opt/hadoop-2.7.2-src
    # mvn package -Pdist,native -DskipTests -Dtar
    
  9. You will see a tarball in the folder hadoop-2.7.3-src/hadoop-dist/target/.

How it works...

The tarball package created will be used for the installation of Hadoop throughout the book. It is not mandatory to build a Hadoop from source, but by default the binary packages provided by Apache Hadoop are 32-bit versions. For production, it is important to use a 64-bit version so as to fully utilize the memory beyond 4 GB and to gain other performance benefits.

Installation methods

Hadoop can be installed in multiple ways, either by using repository methods such as Yum/apt-get or by extracting the tarball packages. The project Bigtop http://bigtop.apache.org/ provides Hadoop packages for infrastructure, and can be used by creating a local repository of the packages.

All the steps are to be performed as the root user. It is expected that the user knows how to set up a yum repository and Linux basics.

Getting ready

You are going to need a Linux machine. You can either use the one which has been used in the previous task or set up a new node, which will act as repository server and host all the packages we need.

How to do it...

  1. Connect to a Linux machine that has at least 5 GB disk space to store the packages.
  2. If you are on CentOS or a similar distribution, make sure you have the package yum-utils installed. This package will provide the command reposync.
  3. Create a file bigtop.repo under /etc/yum.repos.d/. Note that the file name can be anything—only the extension must be .repo.
  4. See the following screenshot for the contents of the file:
    How to do it...
  5. Execute the command reposync –r bigtop. It will create a directory named bigtop under the present working directory with all the packages downloaded to it.
  6. All the required Hadoop packages can be installed by configuring the repository we downloaded as a repository server.

How it works...

From step 2 to step 6, the user will be able to configure and use the Hadoop package repository. Setting up a Yum repository is not required, but it makes things easier if we have to do installations on hundreds of nodes. In larger setups, management systems such as Puppet or Chef will be used for deployment configuration to push configuration and packages to nodes.

In this chapter, we will be using the tarball package that was built in the first section to perform installations. This is the best way of learning about directory structure and the configurations needed.

Setting up host resolution

Before we start with the installations, it is important to make sure that the host resolution is configured and working properly.

Getting ready

Choose any appropriate hostnames the user wants for his or her Linux machines. For example, the hostnames could be master1.cluster.com or rt1.cyrus.com or host1.example.com. The important thing is that the hostnames must resolve.

This resolution can be done using a DNS server or by configuring the/etc/hosts file on each node we use for our cluster setup.

The following steps will show you how to set up the resolution in the/etc/hosts file.

How to do it...

  1. Connect to the Linux machine and change the hostname to master1.cyrus.com in the file as follows:
    How to do it...
  2. Edit the/etc/hosts file as follows:
    How to do it...
  3. Make sure the resolution returns an IP address:
    # getent hosts master1.cyrus.com
    
  4. The other preferred method is to set up the DNS resolution so that we do not have to populate the hosts file on each node. In the example resolution shown here, the user can see that the DNS server is configured to answer the domain cyrus.com:
    # nslookup master1.cyrus.com
    Server:		10.0.0.2
    Address:	10.0.0.2#53
    
    Non-authoritative answer:
    Name:	master1.cyrus.com
    Address: 10.0.0.104
    

How it works...

Each Linux host has a resolver library that helps it resolve any hostname that is asked for. It contacts the DNS server, and if it is not found there, it contacts the hosts file. Users who are not Linux administrators can simply use the hosts files as a workaround to set up a Hadoop cluster. There are many resources available online that could help you to set up a DNS quickly if needed.

Once the resolution is in place, we will start with the installation of Hadoop on a single-node and then progress to multiple nodes.

Installing a single-node cluster - HDFS components

Usually the term cluster means a group of machines, but in this recipe, we will be installing various Hadoop daemons on a single node. The single machine will act as both the master and slave for the storage and processing layer.

Getting ready

You will need some information before stepping through this recipe.

Although Hadoop can be configured to run as root user, it is a good practice to run it as a non-privileged user. In this recipe, we are using the node name nn1.cluster1.com, preinstalled with CentOS 6.5.

Tip

Create a system user named hadoop and set a password for that user.

Install JDK, which will be used by Hadoop services. The minimum recommended version of JDK is 1.7, but Open JDK can also be used.

How to do it...

  1. Log into the machine/host as root user and install jdk:
    # yum install jdk –y
    or it can also be installed using the command as below
    # rpm –ivh jdk-1.7u45.rpm
    
  2. Once Java is installed, make sure Java is in PATH for execution. This can be done by setting JAVA_HOME and exporting it as an environment variable. The following screenshot shows the content of the directory where Java gets installed:
    # export JAVA_HOME=/usr/java/latest
    
    How to do it...
  3. Now we need to copy the tarball hadoop-2.7.3.tar.gz--which was built in the Build Hadoop section earlier in this chapter—to the home directory of the user root. For this, the user needs to login to the node where Hadoop was built and execute the following command:
    # scp –r hadoop-2.7.3.tar.gz root@nn1.cluster1.com:~/
    
  4. Create a directory named/opt/cluster to be used for Hadoop:
    # mkdir –p /opt/cluster
    
  5. Then untar the hadoop-2.7.3.tar.gz to the preceding created directory:
    # tar –xzvf hadoop-2.7.3.tar.gz  -C /opt/Cluster/
    
  6. Create a user named hadoop, if you haven't already, and set the password as hadoop:
    # useradd hadoop
    # echo hadoop | passwd --stdin hadoop
    
  7. As step 6 was done by the root user, the directory and file under /opt/cluster will be owned by the root user. Change the ownership to the Hadoop user:
    # chown -R hadoop:hadoop /opt/cluster/
    
  8. If the user lists the directory structure under /opt/cluster, he will see it as follows:
    How to do it...
  9. The directory structure under /opt/cluster/hadoop-2.7.3 will look like the one shown in the following screenshot:
    How to do it...
  10. The listing shows etc, bin, sbin, and other directories.
  11. The etc/hadoop directory is the one that contains the configuration files for configuring various Hadoop daemons. Some of the key files are core-site.xml, hdfs-site.xml, hadoop-env.xml, and mapred-site.xml among others, which will be explained in the later sections:
    How to do it...
  12. The directories bin and sbin contain executable binaries, which are used to start and stop Hadoop daemons and perform other operations such as filesystem listing, copying, deleting, and so on:
    How to do it...
    How to do it...
  13. To execute a command /opt/cluster/hadoop-2.7.3/bin/hadoop, a complete path to the command needs to be specified. This could be cumbersome, and can be avoided by setting the environment variable HADOOP_HOME.
  14. Similarly, there are other variables that need to be set that point to the binaries and the configuration file locations:
    How to do it...
  15. The environment file is set up system-wide so that any user can use the commands. Once the hadoopenv.sh file is in place, execute the command to export the variables defined in it:
    How to do it...
  16. Change to the Hadoop user using the command su – hadoop:
    How to do it...
  17. Change to the /opt/cluster directory and create a symlink:
    How to do it...
  18. To verify that the preceding changes are in place, the user can execute either the which Hadoop or which java commands, or the user can execute the command hadoop directly without specifying the complete path.
  19. In addition to setting the environment as discussed, the user has to add the JAVA_HOME variable in the hadoop-env.sh file.
  20. The next thing is to set up the Namenode address, which specifies the host:port address on which it will listen. This is done using the file core-site.xml:
    How to do it...
  21. The important thing to keep in mind is the property fs.defaultFS.
  22. The next thing that the user needs to configure is the location where Namenode will store its metadata. This can be any location, but it is recommended that you always have a dedicated disk for it. This is configured in the file hdfs-site.xml:
    How to do it...
  23. The next step is to format the Namenode. This will create an HDFS file system:
    $ hdfs namenode -format
    
  24. Similarly, we have to add the rule for the Datanode directory under hdfs-site.xml. Nothing needs to be done to the core-site.xml file:
    How to do it...
  25. Then the services need to be started for Namenode and Datanode:
    $ hadoop-daemon.sh start namenode
    $ hadoop-daemon.sh start datanode
    
  26. The command jps can be used to check for running daemons:
    How to do it...

How it works...

The master Namenode stores metadata and the slave node Datanode stores the blocks. When the Namenode is formatted, it creates a data structure that contains fsimage, edits, and VERSION. These are very important for the functioning of the cluster.

The parameters dfs.data.dir and dfs.datanode.data.dir are used for the same purpose, but are used across different versions. The older parameters are deprecated in favor of the newer ones, but they will still work. The parameter dfs.name.dir has been deprecated in favor of dfs.namenode.name.dir in Hadoop 2.x. The intention of showing both versions of the parameter is to bring to the user's notice that parameters are evolving and ever changing, and care must be taken by referring to the release notes for each Hadoop version.

There's more...

Setting up ResourceManager and NodeManager

In the preceding recipe, we set up the storage layer—that is, the HDFS for storing data—but what about the processing layer?. The data on HDFS needs to be processed to make a meaningful decision using MapReduce, Tez, Spark, or any other tool. To run the MapReduce, Spark or other processing framework we need to have ResourceManager, NodeManager.

Installing a single-node cluster - YARN components

In the previous recipe, we discussed how to set up Namenode and Datanode for HDFS. In this recipe, we will be covering how to set up YARN on the same node.

After completing this recipe, there will be four daemons running on the nn1.cluster1.com node, namely namenode, datanode, resourcemanager, and nodemanager daemons.

Getting ready

For this recipe, you will again use the same node on which we have already configured the HDFS layer.

All operations will be done by the hadoop user.

How to do it...

  1. Log in to the node nn1.cluster1.com and change to the hadoop user.
  2. Change to the /opt/cluster/hadoop/etc/hadoop directory and configure the files mapred-site.xml and yarn-site.xml:
    How to do it...
  3. The file yarn-site.xml specifies the shuffle class, scheduler, and resource management components of the ResourceManager. You only need to specify yarn.resourcemanager.address; the rest are automatically picked up by the ResourceManager. You can see from the following screenshot that you can separate them into their independent components:
    How to do it...
  4. Once the configurations are in place, the resourcemanager and nodemanager daemons need to be started:
    How to do it...
  5. The environment variables that were defined by /etc/profile.d/hadoopenv.sh included YARN_HOME and YARN_CONF_DIR, which let the framework know about the location of the YARN configurations.

How it works...

The nn1.cluster1.com node is configured to run HDFS and YARN components. Any file that is copied to the HDFS will be split into blocks and stored on Datanode. The metadata of the file will be on the Namenode.

Any operation performed on a text file, such as word count, can be done by running a simple MapReduce program, which will be submitted to the single node cluster using the ResourceManager daemon and executed by the NodeManager. There are a lot of steps and details as to what goes on under the hood, which will be covered in the coming chapters.

Note

The single-node cluster is also called pseudo-distributed cluster.

There's more...

A quick check can be done on the functionality of HDFS. You can create a simple text file and upload it to HDFS to see whether it is successful or not:

$ hadoop fs –put test.txt /

This will copy the file test.txt to the HDFS. The file can be read directly from HDFS:

$ hadoop fs –ls /
$ hadoop fs –cat /test.txt

See also

  • The Installing multi-node cluster recipe

Installing a multi-node cluster

In the previous recipes, we looked at how to configure a single-node Hadoop cluster, also referred to as pseudo-distributed cluster. In this recipe, we will set up a fully distributed cluster with each daemon running on separate nodes.

There will be one node for Namenode, one for ResourceManager, and four nodes will be used for Datanode and NodeManager. In production, the number of Datanodes could be in the thousands, but here we are just restricted to four nodes. The Datanode and NodeManager coexist on the same nodes for the purposes of data locality and locality of reference.

Getting ready

Make sure that the six nodes the user chooses have JDK installed, with name resolution working. This could be done by making entries in the /etc/hosts file or using DNS.

In this recipe, we are using the following nodes:

  • Namenode: nn1.cluster1.com
  • ResourceManager: jt1.cluster1.com
  • Datanodes and NodeManager: dn[1-4].cluster1.com

How to do it...

  1. Make sure all the nodes have the hadoop user.
  2. Create the directory structure /opt/cluster on all the nodes.
  3. Make sure the ownership is correct for /opt/cluster.
  4. Copy the /opt/cluster/hadoop-2.7.3 directory from the nn1.cluster.com to all the nodes in the cluster:
    $ for i in 192.168.1.{72..75};do scp -r hadoop-2.7.3 $i:/opt/cluster/ $i;done
    
  5. The preceding IPs belong to the nodes in the cluster. The user needs to modify them accordingly. Also, to prevent it from prompting for password for each node, it is good to set up pass phraseless access between each node.
  6. Change to the directory /opt/cluster and create a symbolic link on each node:
    $ ln –s hadoop-2.7.3 hadoop
    
  7. Make sure that the environment variables have been set up on all nodes:
    $ . /etc/profile.d/hadoopenv.sh
    
  8. On Namenode, only the parameters specific to it are needed.
  9. The file core-site.xml remains the same across all nodes in the cluster.
  10. On Namenode, the file hdfs-site.xml changes as follows:
    How to do it...
  11. On Datanode, the file hdfs-site.xml changes as follows:
    How to do it...
  12. On Datanodes, the file yarn-site.xml changes as follows:
    How to do it...
  13. On the node jt1, which is ResourceManager, the file yarn-site.xml is as follows:
    How to do it...
  14. To start Namenode on nn1.cluster1.com, enter the following:
    $ hadoop-daemon.sh start namenode
    
  15. To start Datanode and NodeManager on dn[1-4], enter the following:
    $ hadoop-daemon.sh start datanode
    $ yarn-daemon.sh start nodemanager
    
  16. To start ResourceManager on jt1.cluster.com, enter the following:
    $ yarn-daemon.sh start resourcemanager
    
  17. On each node, execute the command jps to see the daemons running on them. Make sure you have the correct services running on each node.
  18. Create a text file test.txt and copy it to HDFS using hadoop fs –put test.txt /. This confirms that HDFS is working fine.
  19. To verify that YARN has been set up correctly, run the simple "Pi" estimation program:
    $ yarn jar /opt/cluster/hadoop/share/hadoop/mapreduce/hadoop-example.jar Pi 3 3
    

How it works...

Steps 1 through 7 copy the already extracted and configured Hadoop files to other nodes in the cluster. From step 8 onwards, each node is configured according to the role it plays in the cluster.

The user should see four Datanodes reporting to the cluster, and should also be able to access the UI of the Namenode on port 50070 and on port 8088 for ResourceManager.

To see the number of nodes talking to Namenode, enter the following:

$ hdfs dfsadmin -report
  Configured Capacity: 9124708352 (21.50 GB)
  Present Capacity: 5923942400 (20.52 GB)
  DFS Remaining: 5923938304 (20.52 GB)
  DFS Used: 4096 (4 KB)
  DFS Used%: 0.00%
Live datanodes (4):

The same information can also be retrieved using the Namenode Web UI as shown in the following screenshot:

How it works...

Note

The user can configure any customer port for any service, but there should be a good reason to change the defaults.

Configuring the Hadoop Gateway node

Hadoop Gateway or edge node is a node that connects to the Hadoop cluster, but does not run any of the daemons. The purpose of an edge node is to provide an access point to the cluster and prevent users from a direct connection to critical components such as Namenode or Datanode.

Another important reason for its use is the data distribution across the cluster. If a user connects to a Datanode and performs the data copy operation hadoop fs –put file /, then one copy of the file will always go to the Datanode from which the copy command was executed. This will result in an imbalance of data across the node. If we upload a file from a node that is not a Datanode, then data will be distributed evenly for all copies of data.

In this recipe, we will configure an edge node for a Hadoop cluster.

Getting ready

For the edge node, the user needs a separate Linux machine with Java installed and the user hadoop in place.

How to do it...

  1. ssh to the new node that is to be configured as Gateway node. For example, the node name could be client1.cluster1.com.
  2. Set up the environment variable as discussed before. This can be done by setting the /etc/profile.d/hadoopenv.sh file.
  3. Copy the already configured directory hadoop-2.7.3 from Namenode to this node (client1.cluster1.com). This avoids doing all the configuration for files such as core-site.xml and yarn-site.xml.
  4. The edge node just needs to know about the two master nodes of Namenode and ResourceManager. It does not need any other configuration for the time being. It does not store any data locally, unlike Namenode and Datanode.
  5. It only needs to write temporary files and logs. In later chapters, we will see other parameters for MapReduce and performance tuning that go on this node.
  6. Create a symbolic link ln –s hadoop-2.7.3 hadoop so that the commands and Hadoop configuration files are visible.
  7. There will be no daemon started on this node. Execute a command from the edge node to make sure the user can connect to hadoop fs –ls /.
  8. To verify that the edge node has been set up correctly, run the simple "Pi" estimation program from the edge node:
    $ yarn jar /opt/cluster/hadoop/share/hadoop/mapreduce/hadoop-example.jar Pi 3 3
    

How it works...

The edge node or the Gateway node connects to Namenode for all HDFS-related operation and connects to ResourceManager for submitted jobs to the cluster.

In production, there will be more than one edge node connecting to the cluster for high availability. This is can be done by using a load balancer or DNS round-robin. No user should run any local jobs on the edge nodes or use it for doing non Hadoop-related tasks.

See also

Edge node can be used to configure many additional components, such as PIG, Hive, Sqoop, rather than installing them on the main cluster nodes like Namenode, Datanode. This way it is easy to segregate the complexity and restrict access to just edge node.

  • The Configuring Hive recipe

Decommissioning nodes

There will always be failures in clusters, such as hardware issues or a need to upgrade nodes. This should be done in a graceful manner, without any data loss.

When the Datanode daemon is stopped on a Datanode, it takes approximately ten minutes for the Namenode to remove that node. This has to do with the heartbeat retry interval. At any time, we can abruptly remove the Datanode, but it can result in data loss.

It is recommended that you opt for the graceful removal of the node from the cluster, as this ensures that all the data on that node is drained.

Getting ready

For the following steps, we assume that the cluster that is up and running with Datanodes is in a healthy state and the one with the Datanode dn1.cluster1.com needs maintenance and must be removed from the cluster. We will login to the Namenode and make changes there.

How to do it...

  1. ssh to Namenode and edit the file hdfs-site.xml by adding the following property to it:
    <property>
    <name>dfs.hosts.exclude</name>
    <value>/home/hadoop/excludes</value>
    <final>true</final>
    </property>
  2. Make sure the file excludes is readable by the user hadoop.
  3. Restart the Namenode daemon for the property to take effect:
    $ hadoop-daemons.sh stop namenode
    $ hadoop-daemons.sh start namenode
    
  4. A restart of Namenode is required only when any property is changed in the file. Once the property is in place, Namenode can read the changes to the contents of the file excludes by simply refreshing nodes.
  5. Add the dn1.cluster1.com node to the file excludes:
    $ cat excludes
    dn1.cluster1.com
    
  6. After adding the node to the file, we just need to reload the file by doing the following:
    $ hadoop dfsadmin -refreshNodes
    
  7. After sometime, the node will be decommissioned. The time will vary according to the data the particular Datanode had. We can see the decommissioned nodes using the following:
    $ hdfs dfsadmin -report
    
  8. The preceding command will list the nodes in the cluster, and against the dn1.cluster1.com node we can see that its status will either be decommissioning or decommissioned.

How it works...

Let's have a look at what we did throughout this recipe.

In steps 1 through 6, we added the new property to the hdfs-site.xml file and then restarted Namenode to make it aware of the changes. Once the property is in place, the Namenode is aware of the excludes file, and it can be asked to re-read by simply refreshing the node list, as done in step 6.

With these steps, the data on the Datanode dn1.cluster1.com will be moved to other nodes in the cluster, and once the data has been drained, the Datanode daemon on the node will be shutdown. During the process, the node will change the status from normal to decommissioning and then to decommissioned.

Care must be taken while decommissioning nodes in the cluster. The user should not decommission multiple nodes at a time as this will generate lot of network traffic and cause congestion and data loss.

See also

  • The Add nodes to the cluster recipe

Adding nodes to the cluster

Over a period of time, our cluster will grow in data and there will be a need to increase the capacity of the cluster by adding more nodes.

We can add Datanodes to the cluster in the same way that we first configured the Datanode started the Datanode daemon on it. But the important thing to keep in mind is that all nodes can be part of the cluster. It should not be that anyone can just start a Datanode daemon on his laptop and join the cluster, as it will be disastrous. By default, there is nothing preventing any node being a Datanode, as the user has just to untar the Hadoop package and point the file "core-site.xml" to the Namenode and start the Datanode daemon.

Getting ready

For the following steps, we assume that the cluster that is up and running with Datanodes is in a healthy state and we need to add a new Datanode in the cluster. We will login to the Namenode and make changes there.

How to do it...

  1. ssh to Namenode and edit the file hdfs-site.xml to add the following property to it:
    <property>
    <name>dfs.hosts</name>
    <value>/home/hadoop/includes</value>
    <final>true</final>
    </property>
  2. Make sure the file includes is readable by the user hadoop.
  3. Restart the Namenode daemon for the property to take effect:
    $ hadoop-daemons.sh stop namenode
    $ hadoop-daemons.sh start namenode
    
  4. A restart of Namenode is required only when any property is changed in the file. Once the property is in place, Namenode can read the changes to the contents of the includes file by simply refreshing the nodes.
  5. Add the dn1.cluster1.com node to the file excludes:
    $ cat includes
    dn1.cluster1.com
    
  6. The file includes or excludes can contain a list of multiple nodes, one node per line.
  7. After adding the node to the file, we just need to reload the file by entering the following:
    $ hadoop dfsadmin -refreshNodes
    
  8. After some time, the node will be available in the cluster and can be seen:
    $ hdfs dfsadmin -report
    

How it works...

The file /home/hadoop/includes will contain a list of all the Datanodes that are allowed to join a cluster. If the file includes is blank, then all Datanodes are allowed to join the cluster. If there is both an include and exclude file, the list of nodes must be mutually exclusive in both the files. So, to decommission the node dn1.cluster.com from the cluster, it must be removed from the includes file and added to the excludes file.

There's more...

In addition to controlling the nodes as we described, there will be firewall rules in place and separate VLANs for Hadoop clusters to keep the traffic and data isolated.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Become an expert Hadoop administrator and perform tasks to optimize your Hadoop Cluster
  • Import and export data into Hive and use Oozie to manage workflow.
  • Practical recipes will help you plan and secure your Hadoop cluster, and make it highly available

Description

Hadoop enables the distributed storage and processing of large datasets across clusters of computers. Learning how to administer Hadoop is crucial to exploit its unique features. With this book, you will be able to overcome common problems encountered in Hadoop administration. The book begins with laying the foundation by showing you the steps needed to set up a Hadoop cluster and its various nodes. You will get a better understanding of how to maintain Hadoop cluster, especially on the HDFS layer and using YARN and MapReduce. Further on, you will explore durability and high availability of a Hadoop cluster. You’ll get a better understanding of the schedulers in Hadoop and how to configure and use them for your tasks. You will also get hands-on experience with the backup and recovery options and the performance tuning aspects of Hadoop. Finally, you will get a better understanding of troubleshooting, diagnostics, and best practices in Hadoop administration. By the end of this book, you will have a proper understanding of working with Hadoop clusters and will also be able to secure, encrypt it, and configure auditing for your Hadoop clusters.

Who is this book for?

If you are a system administrator with a basic understanding of Hadoop and you want to get into Hadoop administration, this book is for you. It’s also ideal if you are a Hadoop administrator who wants a quick reference guide to all the Hadoop administration-related tasks and solutions to commonly occurring problems

What you will learn

  • • Set up the Hadoop architecture to run a Hadoop cluster smoothly
  • • Maintain a Hadoop cluster on HDFS, YARN, and MapReduce
  • • Understand high availability with Zookeeper and Journal Node
  • • Configure Flume for data ingestion and Oozie to run various workflows
  • • Tune the Hadoop cluster for optimal performance
  • • Schedule jobs on a Hadoop cluster using the Fair and Capacity scheduler
  • • Secure your cluster and troubleshoot it for various common pain points

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : May 26, 2017
Length: 348 pages
Edition : 1st
Language : English
ISBN-13 : 9781787126732
Vendor :
Apache
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : May 26, 2017
Length: 348 pages
Edition : 1st
Language : English
ISBN-13 : 9781787126732
Vendor :
Apache
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 102.97
Frank Kane's Taming Big Data with Apache Spark and Python
€32.99
Data Lake for Enterprises
€27.99
Hadoop 2.x Administration Cookbook
€41.99
Total 102.97 Stars icon
Banner background image

Table of Contents

13 Chapters
1. Hadoop Architecture and Deployment Chevron down icon Chevron up icon
2. Maintaining Hadoop Cluster HDFS Chevron down icon Chevron up icon
3. Maintaining Hadoop Cluster – YARN and MapReduce Chevron down icon Chevron up icon
4. High Availability Chevron down icon Chevron up icon
5. Schedulers Chevron down icon Chevron up icon
6. Backup and Recovery Chevron down icon Chevron up icon
7. Data Ingestion and Workflow Chevron down icon Chevron up icon
8. Performance Tuning Chevron down icon Chevron up icon
9. HBase Administration Chevron down icon Chevron up icon
10. Cluster Planning Chevron down icon Chevron up icon
11. Troubleshooting, Diagnostics, and Best Practices Chevron down icon Chevron up icon
12. Security Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.5
(2 Ratings)
5 star 50%
4 star 0%
3 star 0%
2 star 50%
1 star 0%
Jayesh Jun 02, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Writer has fair knowledge on subject.
Amazon Verified review Amazon
Zvonko Pino Varela Apr 29, 2018
Full star icon Full star icon Empty star icon Empty star icon Empty star icon 2
It looks out of date. I followed the instructions step by step but I found a lot of mistakes, and many instructions didn't work. It's far better to install Apache Ambari and follow those directions.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.