Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Hadoop 2.x Administration Cookbook

You're reading from   Hadoop 2.x Administration Cookbook Administer and maintain large Apache Hadoop clusters

Arrow left icon
Product type Paperback
Published in May 2017
Publisher Packt
ISBN-13 9781787126732
Length 348 pages
Edition 1st Edition
Tools
Arrow right icon
Author (1):
Arrow left icon
Aman Singh Aman Singh
Author Profile Icon Aman Singh
Aman Singh
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Hadoop Architecture and Deployment FREE CHAPTER 2. Maintaining Hadoop Cluster HDFS 3. Maintaining Hadoop Cluster – YARN and MapReduce 4. High Availability 5. Schedulers 6. Backup and Recovery 7. Data Ingestion and Workflow 8. Performance Tuning 9. HBase Administration 10. Cluster Planning 11. Troubleshooting, Diagnostics, and Best Practices 12. Security Index

Installing a single-node cluster - HDFS components

Usually the term cluster means a group of machines, but in this recipe, we will be installing various Hadoop daemons on a single node. The single machine will act as both the master and slave for the storage and processing layer.

Getting ready

You will need some information before stepping through this recipe.

Although Hadoop can be configured to run as root user, it is a good practice to run it as a non-privileged user. In this recipe, we are using the node name nn1.cluster1.com, preinstalled with CentOS 6.5.

Tip

Create a system user named hadoop and set a password for that user.

Install JDK, which will be used by Hadoop services. The minimum recommended version of JDK is 1.7, but Open JDK can also be used.

How to do it...

  1. Log into the machine/host as root user and install jdk:
    # yum install jdk –y
    or it can also be installed using the command as below
    # rpm –ivh jdk-1.7u45.rpm
    
  2. Once Java is installed, make sure Java is in PATH for execution. This can be done by setting JAVA_HOME and exporting it as an environment variable. The following screenshot shows the content of the directory where Java gets installed:
    # export JAVA_HOME=/usr/java/latest
    
    How to do it...
  3. Now we need to copy the tarball hadoop-2.7.3.tar.gz--which was built in the Build Hadoop section earlier in this chapter—to the home directory of the user root. For this, the user needs to login to the node where Hadoop was built and execute the following command:
    # scp –r hadoop-2.7.3.tar.gz root@nn1.cluster1.com:~/
    
  4. Create a directory named/opt/cluster to be used for Hadoop:
    # mkdir –p /opt/cluster
    
  5. Then untar the hadoop-2.7.3.tar.gz to the preceding created directory:
    # tar –xzvf hadoop-2.7.3.tar.gz  -C /opt/Cluster/
    
  6. Create a user named hadoop, if you haven't already, and set the password as hadoop:
    # useradd hadoop
    # echo hadoop | passwd --stdin hadoop
    
  7. As step 6 was done by the root user, the directory and file under /opt/cluster will be owned by the root user. Change the ownership to the Hadoop user:
    # chown -R hadoop:hadoop /opt/cluster/
    
  8. If the user lists the directory structure under /opt/cluster, he will see it as follows:
    How to do it...
  9. The directory structure under /opt/cluster/hadoop-2.7.3 will look like the one shown in the following screenshot:
    How to do it...
  10. The listing shows etc, bin, sbin, and other directories.
  11. The etc/hadoop directory is the one that contains the configuration files for configuring various Hadoop daemons. Some of the key files are core-site.xml, hdfs-site.xml, hadoop-env.xml, and mapred-site.xml among others, which will be explained in the later sections:
    How to do it...
  12. The directories bin and sbin contain executable binaries, which are used to start and stop Hadoop daemons and perform other operations such as filesystem listing, copying, deleting, and so on:
    How to do it...
    How to do it...
  13. To execute a command /opt/cluster/hadoop-2.7.3/bin/hadoop, a complete path to the command needs to be specified. This could be cumbersome, and can be avoided by setting the environment variable HADOOP_HOME.
  14. Similarly, there are other variables that need to be set that point to the binaries and the configuration file locations:
    How to do it...
  15. The environment file is set up system-wide so that any user can use the commands. Once the hadoopenv.sh file is in place, execute the command to export the variables defined in it:
    How to do it...
  16. Change to the Hadoop user using the command su – hadoop:
    How to do it...
  17. Change to the /opt/cluster directory and create a symlink:
    How to do it...
  18. To verify that the preceding changes are in place, the user can execute either the which Hadoop or which java commands, or the user can execute the command hadoop directly without specifying the complete path.
  19. In addition to setting the environment as discussed, the user has to add the JAVA_HOME variable in the hadoop-env.sh file.
  20. The next thing is to set up the Namenode address, which specifies the host:port address on which it will listen. This is done using the file core-site.xml:
    How to do it...
  21. The important thing to keep in mind is the property fs.defaultFS.
  22. The next thing that the user needs to configure is the location where Namenode will store its metadata. This can be any location, but it is recommended that you always have a dedicated disk for it. This is configured in the file hdfs-site.xml:
    How to do it...
  23. The next step is to format the Namenode. This will create an HDFS file system:
    $ hdfs namenode -format
    
  24. Similarly, we have to add the rule for the Datanode directory under hdfs-site.xml. Nothing needs to be done to the core-site.xml file:
    How to do it...
  25. Then the services need to be started for Namenode and Datanode:
    $ hadoop-daemon.sh start namenode
    $ hadoop-daemon.sh start datanode
    
  26. The command jps can be used to check for running daemons:
    How to do it...

How it works...

The master Namenode stores metadata and the slave node Datanode stores the blocks. When the Namenode is formatted, it creates a data structure that contains fsimage, edits, and VERSION. These are very important for the functioning of the cluster.

The parameters dfs.data.dir and dfs.datanode.data.dir are used for the same purpose, but are used across different versions. The older parameters are deprecated in favor of the newer ones, but they will still work. The parameter dfs.name.dir has been deprecated in favor of dfs.namenode.name.dir in Hadoop 2.x. The intention of showing both versions of the parameter is to bring to the user's notice that parameters are evolving and ever changing, and care must be taken by referring to the release notes for each Hadoop version.

There's more...

Setting up ResourceManager and NodeManager

In the preceding recipe, we set up the storage layer—that is, the HDFS for storing data—but what about the processing layer?. The data on HDFS needs to be processed to make a meaningful decision using MapReduce, Tez, Spark, or any other tool. To run the MapReduce, Spark or other processing framework we need to have ResourceManager, NodeManager.

You have been reading a chapter from
Hadoop 2.x Administration Cookbook
Published in: May 2017
Publisher: Packt
ISBN-13: 9781787126732
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime