Setting up an EMR cluster for ETL
In the case of DL, the computational power of a single EC2 instance may not be sufficient for model training or data processing. Therefore, a group of EC2 instances is often put together to increase the throughput. AWS has a dedicated service for this purpose: Amazon Elastic MapReduce (EMR). It is a fully managed cluster platform that provides distributed systems for big data frameworks such as Apache Spark and Hadoop. In general, an EMR cluster that’s been set up for ETL reads data from AWS storage (Amazon S3), processes the data, and writes it back to AWS storage. Spark jobs are often used to handle the ETL logic that interacts with S3. EMR provides an interesting feature named Workspace that helps organize notebooks by developers and shares them with other EMR users for collaborative work.
A typical EMR setup contains a master node and a few core nodes. In the case of a multi-node cluster, there must be at least one core node. A master...