Understanding clusters and nodes
The primary construct or component of Amazon EMR is the cluster, and the cluster is a collection of Amazon EC2 instances, which are called nodes. Each node within the cluster has a type, depending on the role it plays or the job it does in the cluster. Based on the node type, respective Hadoop libraries are installed and configured on that instance.
The following are the node types available in EMR:
- Master node: Master nodes are responsible for managing cluster instances, monitoring health, coordinating job execution, tracking the status of tasks, and so on. This is a must-have node type when you create a cluster and you can have a single node cluster with just a master node in it.
- Core node: This node type is responsible for storing data in the HDFS on your cluster and runs Hadoop application services such as Hive, Pig, HBase, and Hue. If you have a multi-node cluster, then you should have at least one core node.
- Task node: This...