Migrating ETL jobs and Oozie workflows
If you are doing lift and shift and your ETL scripts are configured to read from and write to HDFS, then your existing ETL scripts such as Hive, MapReduce, and Spark will work just fine in EMR without substantial changes. But if, while migrating to AWS, you re-architected to use Amazon S3 as your persistent layer instead of HDFS, then you will have to change your scripts to interact with Amazon S3 (s3://
) using EMRFS.
Important Note
Prior to the release of Amazon EMR 5.22.0, EMR supported the s3a://
and s3n://
prefixes to interact with EMRFS. These prefixes haven't been deprecated and still work, but it is now recommended to use the new s3://
, which provides a higher level of security and easier integration with Amazon S3.
Apart from your Hive and Spark scripts, if you are using Apache Oozie for workflow orchestration of your ETL jobs, then you need to plan for its migration too. Let's understand what options you have for...