Reference architecture for interactive analytics and ML
In the previous sections of this chapter, you might have seen the usage of Amazon EMR as a transient cluster that gets created through file arrival or a scheduled event, processes the file with Hive or Spark steps, and then gets terminated. Transient clusters are great to decouple storage and compute and also to save costs by reducing cluster idle time.
But there are few use cases where you might need a persistent EMR cluster that might be active 24x7 with minimal cluster node capacity and goes through the EMR autoscaling feature to scale up and down as needed. These persistent clusters generally serve multiple workloads, including ETL transformations with Hive/Spark, analyzing data through SQL-based query engines such as Hive and Presto, or interactive ML model development through notebooks. In a few cases, you can implement a multi-tenant EMR cluster that serves multiple teams with an access policy and data isolation.
...