Installing a multi-node Hadoop cluster
Now that we are comfortable with a single-node Hadoop installation, it's time to learn about a multi-node Hadoop installation.
Getting ready
In the previous recipe, we used a single Ubuntu machine for installation; in this recipe, we will be using three Ubuntu machines. If you are an individual trying to install Hadoop for your own purposes and you don't have three machines to try this recipe, I would suggest that you get three AWS EC2 Ubuntu machines. I am using the t2.small
type of EC2 instances. For more information on this, go to https://aws.amazon.com/ec2/.
Apart from this, I've also performed the following configurations on all the EC2 instances that I have been using:
- Create an AWS security group to allow access to traffic to EC2 instances, and add EC2 instances to this security group.
- Change the hostname of EC2 instances to their public hostnames like this:
sudo hostname ec2-52-10-22-65.us-west-2.compute.amazonaws.com
- Disable firewalls for EC2 Ubuntu instances:
sudo ufw disable
How to do it...
There are a lot of similarities between single-node and multi-node Hadoop installations, so instead of repeating the steps, I would suggest that you refer to earlier recipes as and when they're mentioned. So, let's start installing a multi-node Hadoop cluster:
- Install Java and Hadoop, as discussed in the previous recipe, on the master and slave nodes. Refer to steps 1-5 in the previous recipe.
- AWS EC2 has a built-in installation of
ssh
so there's no need to install it again. To configure it, we need to perform the following steps.First, copy the PEM key with which you initiated EC2 instances to the master node. Next, you need to execute the following set of commands that will add an identity into the master's
ssh
configurations, which can be used to perform passwordless logins to slave machines:eval `ssh-agent -s` chmod 644 $HOME/.ssh/authorized_keys chmod 400 <my-pem-key>.pem ssh-add <my-pem-key>.pem
But if you are NOT using AWS EC2, then you need to generate the
ssh
key on the master, and this key needs to be copied to slave machines. Here is a sample command to do this:ssh-keygen -t rsa -P "" ssh-copy-id -i $HOME/.ssh/id_rsa.pub ubuntu@slave1 ssh-copy-id -i $HOME/.ssh/id_rsa.pub ubuntu@slave2
- Next, we need to perform the Hadoop configurations—most of the configuration files will be same as they were in the case of the single-node installation. These configurations are the same for all the nodes in the cluster. Refer to step 8 from the previous recipe for
hadoop-env.sh
,mapred-site.xml
, andhdfs-site.xml
. Forcore-site.xml
andyarn-site.xml
, we need to add some more properties, as shown here:Edit
core-site.xml
and add the host and port on which you wish to installNameNode
. As this is a multi-node Hadoop cluster installation, we will add the master's hostname instead of the localhost:<configuration> <property> <name>fs.default.name</name> <value>hdfs://<master's-host-name>:9000/</value> </property> </configuration>
Edit
yarn-site.xml
and add the following properties. As this is a multi-node installation, we also need to provide the address of the machine whereResourceManager
is running:<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value><master's-host-name></value> </property> </configuration>
In the case of
hdfs-site.xml
, in the previous recipe, we set the replication factor to 1. As this is a multi-node cluster, we set it to 3. Don't forget to create storage folders configured inhdfs-site.xml
.These configurations need to be made on all the machines of the cluster.
- Now that we are done with configurations, execute the
namenode
format command so that it creates the required subfolder structure:hadoop namenode -format
- Now, we need to start specific services on specific nodes in order to start the cluster.
On the master node, execute following:
/usr/local/hadoop/sbin/hadoop-daemon.sh start namenode /usr/local/hadoop/sbin/hadoop-daemon.sh start secondarynamenode /usr/local/hadoop/sbin/yarn-daemon.sh start resourcemanager
On all slave nodes, execute following:
/usr/local/hadoop/sbin/hadoop-daemon.sh start datanode /usr/local/hadoop/sbin/yarn-daemon.sh start nodemanager
If everything goes well, you should be able to see the cluster running properly. You can also check out the web interfaces for
NameNode
and Resource Managers, for example, by going tohttp://<master-ip-hostname>:50070/
.
For ResourceManager
, go to http://<master-ip-hostname>/8088
How it works...
Refer to the How it works section from the previous recipe.