Using the distributed copy (DistCp)
Distributed copy (DistCp) is a Hadoop utility used to copy data in parallel within and between clusters. It uses Hadoop's MapReduce to perform the copy operation. DistCp is the most widely used data transfer tool in Hadoop clusters. For example:
$ hadoop distcp hdfs://namenode1/src hdfs://namenode2/dest
The preceding command would copy the src
folder and all its contents from the cluster managed by namenode1
to the cluster managed by namenode2
as the dest
folder. DistCp, by default, does not overwrite the files at the target location and skips copying them if the files already exists. However, files can be forced to be overwritten using the overwrite
flag.
There are several options that can be used along with the Hadoop distcp
command and the details of these options can be found at http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_distcp_data_cluster_migrate.html.