Setting up a Hadoop cluster
In this case, assuming that you already have a single node setup as explained in the previous sections, with ssh being enabled, you just need to change all the slave configurations to point to the master. This can be achieved by first introducing the slaves
file in the $HADOOP_PREFIX/etc/Hadoop
folder. Similarly, on all slaves, you require the master
file in the $HADOOP_PREFIX/etc/Hadoop
folder to point to your master server hostname.
Note
While adding new entries for the hostname, one must ensure that the firewall is disabled to allow remote nodes access to different ports. Alternatively, specific ports can be opened/modified by modifying the Hadoop configuration files. Similarly, all the names of nodes that are participating in the cluster should be resolvable through DNS (which stands for Domain Name System), or through the /etc/host
entries of Linux.
Once this is ready, let us change the configuration files. Open core-sites.xml
, and add the following entry in it:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://master-server:9000</value> </property> </configuration>
All other configuration is optional. Now, run the servers in the following order: First, you need to format your storage for the cluster; use the following command to do so:
$ $HADOOP_PREFIX/bin/Hadoop dfs namenode -format <Name of Cluster>
This formats the name node for a new cluster. Once the name node is formatted, the next step is to ensure that DFS is up and connected to each node. Start namenode
, followed by the data nodes:
$ $HADOOP_PREFIX/sbin/Hadoop-daemon.sh start namenode
Similarly, the datanode can be started from all the slaves.
$ $HADOOP_PREFIX/sbin/Hadoop-daemon.sh start datanode
Keep track of the log files in the $HADOOP_PREFIX/logs
folder in order to see that there are no exceptions. Once the HDFS is available, namenode can be accessed through the web as shown here:
The next step is to start YARN and its associated applications. First, start with the RM:
$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh start resourcemanager
Each node must run an instance of one node manager. To run the node manager, use the following command:
$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh start nodemanager
Optionally, you can also run Job History Server on the Hadoop cluster by using the following command:
$ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh start historyserver
Once all instances are up, you can see the status of the cluster on the web through the RM UI as shown in the following screenshot. The complete setup can be tested by running the simple wordcount example.
This way, your cluster is set up and is ready to run with multiple nodes. For advanced setup instructions, do visit the Apache Hadoop website at http://Hadoop.apache.org.