Using S3 versus HDFS for cluster storage
As you may have understood by now, EMR has the flexibility to choose HDFS or EMRFS + S3 as the cluster's persistent storage. As explained previously, EMR has different types of nodes: the master node, core nodes, and task nodes.
Now, let's understand how both of these storage layers are different and which problem statements they solve.
HDFS as cluster-persistent storage
As you can see from the following diagram, there are multiple core nodes pointing to the master node, and each core node has its own CPU, memory, and HDFS storage:
These are some properties to be aware of when your cluster uses HDFS as persistent storage:
- You need to maintain by default three copies of data across the core nodes to be fault-tolerant.
- An EMR cluster is deployed in a single Availability Zone (AZ) of a Region, so a complete AZ failure...