Understanding HDFS backups
Data volumes in Hadoop clusters range from terabytes to petabytes, and deciding what data to back up from such clusters is an important decision. A disaster recovery plan for Hadoop clusters needs to be formulated right at the cluster planning stages. The organization needs to identify the datasets they want to back up and plan backup storage requirements accordingly.
Backup schedules also need to be considered when designing a backup solution. The larger the data that needs to be backed up, the more time-consuming the activity. It would be more efficient if backups could be performed during a window when there is the least amount of activity on the cluster. This not only helps the backup commands to run efficiently, but also ensures data consistency of the datasets being backed up. Knowing the possible schedules of the data infusion to HDFS in advance helps you to better plan and schedule backup solutions for Hadoop clusters.
The following are some of the important...