Migrating data and metadata catalogs
As we learned earlier, using Amazon S3 as the persistent data store is the recommended approach when migrating your workloads to AWS or Amazon EMR. If your on-premise environment does not use Amazon S3 as the persistent data store or your existing cluster has Hive Metastore tables, then you need to plan for migrating both data and metadata.
Let's understand what options we have when planning to migrate on-premises cluster data and/or metadata catalogs.
Migrating data
To migrate your on-premises datasets to Amazon S3 or other storage solutions in AWS, you can consider the following tools and services AWS offers:
- Offline data movement using AWS Snowball and Snowmobile, which helps to migrate petabyte- and exabyte-scale datasets.
- For faster online data movement, integrate AWS Direct Connect, which provides dedicated internet bandwidth for data transfers.
- Use Hadoop's
distcp
command to do a distributed copy from on...