Best practices around EMR cluster configurations
When you start using EMR clusters for your Hadoop workloads, the primary focus is on writing logic for the ETL pipeline, so that your data gets processed and is available for end user consumption. There are several factors you might have considered to optimize your ETL code and the business logic integrated into it but, apart from optimizing code, there are several other optimizations that you can consider in terms of EMR cluster configurations that can help optimize your usage.
Let's understand some of the best practices that you can follow.
Choosing the correct cluster type (transient versus long-running)
As explained in Chapter 2, Exploring the Architecture and Deployment Options, there are two ways you can integrate EMR clusters. One is a long-running EMR cluster, which is useful for multi-tenant workloads or real-time streaming workloads. Then we have short-term job-specific transient clusters that get created when...