Use case and architecture overview
For this use case, let's assume you have a vendor who provides incremental sales data at the end of every day. The file arrives in S3 as CSV and it needs to be processed and made available to your data analysts for querying.
Your assignment is to build a data pipeline that automatically picks up the new sales file from the S3 input bucket, processes it with required transformations, and makes it available in the target S3 bucket, which will be used for querying. To implement this pipeline, you have planned to integrate a transient EMR cluster with Spark as the distributed processing engine. This EMR cluster is not active and gets created just before executing the job and gets terminated after completing the job.
Architecture overview
The following is the high-level architecture diagram of the data pipeline:
Here are the steps as shown...