Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Hadoop 2.x Administration Cookbook

You're reading from   Hadoop 2.x Administration Cookbook Administer and maintain large Apache Hadoop clusters

Arrow left icon
Product type Paperback
Published in May 2017
Publisher Packt
ISBN-13 9781787126732
Length 348 pages
Edition 1st Edition
Tools
Arrow right icon
Author (1):
Arrow left icon
Aman Singh Aman Singh
Author Profile Icon Aman Singh
Aman Singh
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Hadoop Architecture and Deployment FREE CHAPTER 2. Maintaining Hadoop Cluster HDFS 3. Maintaining Hadoop Cluster – YARN and MapReduce 4. High Availability 5. Schedulers 6. Backup and Recovery 7. Data Ingestion and Workflow 8. Performance Tuning 9. HBase Administration 10. Cluster Planning 11. Troubleshooting, Diagnostics, and Best Practices 12. Security Index

Installing a multi-node cluster

In the previous recipes, we looked at how to configure a single-node Hadoop cluster, also referred to as pseudo-distributed cluster. In this recipe, we will set up a fully distributed cluster with each daemon running on separate nodes.

There will be one node for Namenode, one for ResourceManager, and four nodes will be used for Datanode and NodeManager. In production, the number of Datanodes could be in the thousands, but here we are just restricted to four nodes. The Datanode and NodeManager coexist on the same nodes for the purposes of data locality and locality of reference.

Getting ready

Make sure that the six nodes the user chooses have JDK installed, with name resolution working. This could be done by making entries in the /etc/hosts file or using DNS.

In this recipe, we are using the following nodes:

  • Namenode: nn1.cluster1.com
  • ResourceManager: jt1.cluster1.com
  • Datanodes and NodeManager: dn[1-4].cluster1.com

How to do it...

  1. Make sure all the nodes have the hadoop user.
  2. Create the directory structure /opt/cluster on all the nodes.
  3. Make sure the ownership is correct for /opt/cluster.
  4. Copy the /opt/cluster/hadoop-2.7.3 directory from the nn1.cluster.com to all the nodes in the cluster:
    $ for i in 192.168.1.{72..75};do scp -r hadoop-2.7.3 $i:/opt/cluster/ $i;done
    
  5. The preceding IPs belong to the nodes in the cluster. The user needs to modify them accordingly. Also, to prevent it from prompting for password for each node, it is good to set up pass phraseless access between each node.
  6. Change to the directory /opt/cluster and create a symbolic link on each node:
    $ ln –s hadoop-2.7.3 hadoop
    
  7. Make sure that the environment variables have been set up on all nodes:
    $ . /etc/profile.d/hadoopenv.sh
    
  8. On Namenode, only the parameters specific to it are needed.
  9. The file core-site.xml remains the same across all nodes in the cluster.
  10. On Namenode, the file hdfs-site.xml changes as follows:
    How to do it...
  11. On Datanode, the file hdfs-site.xml changes as follows:
    How to do it...
  12. On Datanodes, the file yarn-site.xml changes as follows:
    How to do it...
  13. On the node jt1, which is ResourceManager, the file yarn-site.xml is as follows:
    How to do it...
  14. To start Namenode on nn1.cluster1.com, enter the following:
    $ hadoop-daemon.sh start namenode
    
  15. To start Datanode and NodeManager on dn[1-4], enter the following:
    $ hadoop-daemon.sh start datanode
    $ yarn-daemon.sh start nodemanager
    
  16. To start ResourceManager on jt1.cluster.com, enter the following:
    $ yarn-daemon.sh start resourcemanager
    
  17. On each node, execute the command jps to see the daemons running on them. Make sure you have the correct services running on each node.
  18. Create a text file test.txt and copy it to HDFS using hadoop fs –put test.txt /. This confirms that HDFS is working fine.
  19. To verify that YARN has been set up correctly, run the simple "Pi" estimation program:
    $ yarn jar /opt/cluster/hadoop/share/hadoop/mapreduce/hadoop-example.jar Pi 3 3
    

How it works...

Steps 1 through 7 copy the already extracted and configured Hadoop files to other nodes in the cluster. From step 8 onwards, each node is configured according to the role it plays in the cluster.

The user should see four Datanodes reporting to the cluster, and should also be able to access the UI of the Namenode on port 50070 and on port 8088 for ResourceManager.

To see the number of nodes talking to Namenode, enter the following:

$ hdfs dfsadmin -report
  Configured Capacity: 9124708352 (21.50 GB)
  Present Capacity: 5923942400 (20.52 GB)
  DFS Remaining: 5923938304 (20.52 GB)
  DFS Used: 4096 (4 KB)
  DFS Used%: 0.00%
Live datanodes (4):

The same information can also be retrieved using the Namenode Web UI as shown in the following screenshot:

How it works...

Note

The user can configure any customer port for any service, but there should be a good reason to change the defaults.

You have been reading a chapter from
Hadoop 2.x Administration Cookbook
Published in: May 2017
Publisher: Packt
ISBN-13: 9781787126732
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image