HDFS achieves fault-tolerance by replicating each block three times by default. However, in big clusters, the replication factor can be more. The purpose of replication is to handle data loss against machine failure, providing data locality for the MapReduce job, and so on. The replication takes more storage space, which means that if our replication factor is three, HDFS will take an extra 200% space to store file. In short, storing 1 GB of data will require 3 GB of memory. This also causes metadata memory on NameNode.
HDFS introduced erasure coding (EC) for storing data by taking less storage space. Now, data is labelled based on their usage access pattern, and after the conditions for erasure coding have been satisfied, data will be applicable for erasure coding. The term Data Temperature is used to identify the data usage pattern. The different...