Understanding EMR file system
To run a large data workload, you need scalable storage and a file system to support that storage. One major differentiation for EMR is its support for S3, for which AWS built a propriety file system called EMRFS (EMR file system), which continues supporting other traditional file systems. Let's look into file systems supported by EMR:
- HDFS (Hadoop Distributed File System) - HDFS is a distributed, scalable, and portable file system for Hadoop. An advantage of HDFS is data awareness between the Hadoop cluster nodes managing the clusters and the Hadoop cluster nodes managing the individual steps. For more information, see https://hadoop.apache.org/docs/stable/. The leader and core nodes use HDFS. One advantage is that it's fast; a disadvantage is its ephemeral storage which is reclaimed when the cluster ends. It's best used for caching the results produced by intermediate job-flow steps.
- EMRFS (EMR File System) - EMRFS is an implementation of...