Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Hadoop Real-World Solutions Cookbook- Second Edition

You're reading from   Hadoop Real-World Solutions Cookbook- Second Edition Over 90 hands-on recipes to help you learn and master the intricacies of Apache Hadoop 2.X, YARN, Hive, Pig, Oozie, Flume, Sqoop, Apache Spark, and Mahout

Arrow left icon
Product type Paperback
Published in Mar 2016
Publisher
ISBN-13 9781784395506
Length 290 pages
Edition 2nd Edition
Tools
Arrow right icon
Author (1):
Arrow left icon
Tanmay Deshpande Tanmay Deshpande
Author Profile Icon Tanmay Deshpande
Tanmay Deshpande
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Preface 1. Getting Started with Hadoop 2.X FREE CHAPTER 2. Exploring HDFS 3. Mastering Map Reduce Programs 4. Data Analysis Using Hive, Pig, and Hbase 5. Advanced Data Analysis Using Hive 6. Data Import/Export Using Sqoop and Flume 7. Automation of Hadoop Tasks Using Oozie 8. Machine Learning and Predictive Analytics Using Mahout and R 9. Integration with Apache Spark 10. Hadoop Use Cases Index

Installing a multi-node Hadoop cluster

Now that we are comfortable with a single-node Hadoop installation, it's time to learn about a multi-node Hadoop installation.

Getting ready

In the previous recipe, we used a single Ubuntu machine for installation; in this recipe, we will be using three Ubuntu machines. If you are an individual trying to install Hadoop for your own purposes and you don't have three machines to try this recipe, I would suggest that you get three AWS EC2 Ubuntu machines. I am using the t2.small type of EC2 instances. For more information on this, go to https://aws.amazon.com/ec2/.

Apart from this, I've also performed the following configurations on all the EC2 instances that I have been using:

  1. Create an AWS security group to allow access to traffic to EC2 instances, and add EC2 instances to this security group.
  2. Change the hostname of EC2 instances to their public hostnames like this:
    sudo hostname ec2-52-10-22-65.us-west-2.compute.amazonaws.com
    
  3. Disable firewalls for EC2 Ubuntu instances:
    sudo ufw disable
    

How to do it...

There are a lot of similarities between single-node and multi-node Hadoop installations, so instead of repeating the steps, I would suggest that you refer to earlier recipes as and when they're mentioned. So, let's start installing a multi-node Hadoop cluster:

  1. Install Java and Hadoop, as discussed in the previous recipe, on the master and slave nodes. Refer to steps 1-5 in the previous recipe.
  2. AWS EC2 has a built-in installation of ssh so there's no need to install it again. To configure it, we need to perform the following steps.

    First, copy the PEM key with which you initiated EC2 instances to the master node. Next, you need to execute the following set of commands that will add an identity into the master's ssh configurations, which can be used to perform passwordless logins to slave machines:

    eval `ssh-agent -s`
    chmod 644 $HOME/.ssh/authorized_keys
    chmod 400 <my-pem-key>.pem
    ssh-add <my-pem-key>.pem
    

    But if you are NOT using AWS EC2, then you need to generate the ssh key on the master, and this key needs to be copied to slave machines. Here is a sample command to do this:

    ssh-keygen -t rsa -P ""
    ssh-copy-id -i $HOME/.ssh/id_rsa.pub ubuntu@slave1
    ssh-copy-id -i $HOME/.ssh/id_rsa.pub ubuntu@slave2
    
  3. Next, we need to perform the Hadoop configurations—most of the configuration files will be same as they were in the case of the single-node installation. These configurations are the same for all the nodes in the cluster. Refer to step 8 from the previous recipe for hadoop-env.sh, mapred-site.xml, and hdfs-site.xml. For core-site.xml and yarn-site.xml, we need to add some more properties, as shown here:

    Edit core-site.xml and add the host and port on which you wish to install NameNode. As this is a multi-node Hadoop cluster installation, we will add the master's hostname instead of the localhost:

    <configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://<master's-host-name>:9000/</value>
    </property>
    </configuration>

    Edit yarn-site.xml and add the following properties. As this is a multi-node installation, we also need to provide the address of the machine where ResourceManager is running:

    <configuration>
        <property>
          <name>yarn.nodemanager.aux-services</name>
          <value>mapreduce_shuffle</value>
        </property>
        <property>
          <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
          <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
        <property>
            <name>yarn.resourcemanager.hostname</name>
            <value><master's-host-name></value>
        </property>
    </configuration>

    In the case of hdfs-site.xml, in the previous recipe, we set the replication factor to 1. As this is a multi-node cluster, we set it to 3. Don't forget to create storage folders configured in hdfs-site.xml.

    These configurations need to be made on all the machines of the cluster.

  4. Now that we are done with configurations, execute the namenode format command so that it creates the required subfolder structure:
    hadoop namenode -format
    
  5. Now, we need to start specific services on specific nodes in order to start the cluster.

    On the master node, execute following:

    /usr/local/hadoop/sbin/hadoop-daemon.sh start namenode
    /usr/local/hadoop/sbin/hadoop-daemon.sh start secondarynamenode
    /usr/local/hadoop/sbin/yarn-daemon.sh start resourcemanager
    

    On all slave nodes, execute following:

    /usr/local/hadoop/sbin/hadoop-daemon.sh start datanode
    /usr/local/hadoop/sbin/yarn-daemon.sh start nodemanager
    

    If everything goes well, you should be able to see the cluster running properly. You can also check out the web interfaces for NameNode and Resource Managers, for example, by going to http://<master-ip-hostname>:50070/.

    How to do it...

For ResourceManager, go to http://<master-ip-hostname>/8088

How to do it...

How it works...

Refer to the How it works section from the previous recipe.

You have been reading a chapter from
Hadoop Real-World Solutions Cookbook- Second Edition - Second Edition
Published in: Mar 2016
Publisher:
ISBN-13: 9781784395506
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime