You're reading from Hadoop Real-World Solutions Cookbook- Second Edition Over 90 hands-on recipes to help you learn and master the intricacies of Apache Hadoop 2.X, YARN, Hive, Pig, Oozie, Flume, Sqoop, Apache Spark, and Mahout

Product type Paperback

Published in Mar 2016

Publisher

ISBN-13 9781784395506

Length 290 pages

Edition 2nd Edition

Tools

Hadoop

Concepts

Data Processing

Author (1):

Tanmay Deshpande

View More author details

Table of Contents (12) Chapters

Preface

1. Getting Started with Hadoop 2.X FREE CHAPTER

2. Exploring HDFS

3. Mastering Map Reduce Programs

4. Data Analysis Using Hive, Pig, and Hbase

5. Advanced Data Analysis Using Hive

6. Data Import/Export Using Sqoop and Flume

7. Automation of Hadoop Tasks Using Oozie

8. Machine Learning and Predictive Analytics Using Mahout and R

9. Integration with Apache Spark

10. Hadoop Use Cases

Index

Installing a multi-node Hadoop cluster

Now that we are comfortable with a single-node Hadoop installation, it's time to learn about a multi-node Hadoop installation.

Getting ready

In the previous recipe, we used a single Ubuntu machine for installation; in this recipe, we will be using three Ubuntu machines. If you are an individual trying to install Hadoop for your own purposes and you don't have three machines to try this recipe, I would suggest that you get three AWS EC2 Ubuntu machines. I am using the t2.small type of EC2 instances. For more information on this, go to https://aws.amazon.com/ec2/.

Apart from this, I've also performed the following configurations on all the EC2 instances that I have been using:

Create an AWS security group to allow access to traffic to EC2 instances, and add EC2 instances to this security group.
Change the hostname of EC2 instances to their public hostnames like this:
```
sudo hostname ec2-52-10-22-65.us-west-2.compute.amazonaws.com
```
Disable firewalls for EC2 Ubuntu instances:
```
sudo ufw disable
```

How to do it...

There are a lot of similarities between single-node and multi-node Hadoop installations, so instead of repeating the steps, I would suggest that you refer to earlier recipes as and when they're mentioned. So, let's start installing a multi-node Hadoop cluster:

Install Java and Hadoop, as discussed in the previous recipe, on the master and slave nodes. Refer to steps 1-5 in the previous recipe.
AWS EC2 has a built-in installation of ssh so there's no need to install it again. To configure it, we need to perform the following steps.
First, copy the PEM key with which you initiated EC2 instances to the master node. Next, you need to execute the following set of commands that will add an identity into the master's ssh configurations, which can be used to perform passwordless logins to slave machines:
```
eval `ssh-agent -s`
chmod 644 $HOME/.ssh/authorized_keys
chmod 400 <my-pem-key>.pem
ssh-add <my-pem-key>.pem
```
But if you are NOT using AWS EC2, then you need to generate the ssh key on the master, and this key needs to be copied to slave machines. Here is a sample command to do this:
```
ssh-keygen -t rsa -P ""
ssh-copy-id -i $HOME/.ssh/id_rsa.pub ubuntu@slave1
ssh-copy-id -i $HOME/.ssh/id_rsa.pub ubuntu@slave2
```
Next, we need to perform the Hadoop configurations—most of the configuration files will be same as they were in the case of the single-node installation. These configurations are the same for all the nodes in the cluster. Refer to step 8 from the previous recipe for hadoop-env.sh, mapred-site.xml, and hdfs-site.xml. For core-site.xml and yarn-site.xml, we need to add some more properties, as shown here:
Edit core-site.xml and add the host and port on which you wish to install NameNode. As this is a multi-node Hadoop cluster installation, we will add the master's hostname instead of the localhost:
```
<configuration>
<property>
    <name>fs.default.name</name>
    <value>hdfs://<master's-host-name>:9000/</value>
</property>
</configuration>
```
Edit yarn-site.xml and add the following properties. As this is a multi-node installation, we also need to provide the address of the machine where ResourceManager is running:
```
<configuration>
    <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
    </property>
    <property>
      <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
      <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value><master's-host-name></value>
    </property>
</configuration>
```
In the case of hdfs-site.xml, in the previous recipe, we set the replication factor to 1. As this is a multi-node cluster, we set it to 3. Don't forget to create storage folders configured in hdfs-site.xml.
These configurations need to be made on all the machines of the cluster.
Now that we are done with configurations, execute the namenode format command so that it creates the required subfolder structure:
```
hadoop namenode -format
```
Now, we need to start specific services on specific nodes in order to start the cluster.
On the master node, execute following:
```
/usr/local/hadoop/sbin/hadoop-daemon.sh start namenode
/usr/local/hadoop/sbin/hadoop-daemon.sh start secondarynamenode
/usr/local/hadoop/sbin/yarn-daemon.sh start resourcemanager
```
On all slave nodes, execute following:
```
/usr/local/hadoop/sbin/hadoop-daemon.sh start datanode
/usr/local/hadoop/sbin/yarn-daemon.sh start nodemanager
```
If everything goes well, you should be able to see the cluster running properly. You can also check out the web interfaces for NameNode and Resource Managers, for example, by going to http://<master-ip-hostname>:50070/.