You're reading from Simplify Big Data Analytics with Amazon EMR A beginner's guide to learning and implementing Amazon EMR for building data analytics solutions

Product type Paperback

Published in Mar 2022

Publisher Packt

ISBN-13 9781801071079

Length 430 pages

Edition 1st Edition

Tools

Amazon EMR

Concepts

Big Data

Author (1):

Sakti Mishra

View More author details

Table of Contents (19) Chapters

Preface

1. Section 1: Overview, Architecture, Big Data Applications, and Common Use Cases of Amazon EMR

2. Chapter 1: An Overview of Amazon EMR FREE CHAPTER

3. Chapter 2: Exploring the Architecture and Deployment Options

4. Chapter 3: Common Use Cases and Architecture Patterns

5. Chapter 4: Big Data Applications and Notebooks Available in Amazon EMR

6. Section 2: Configuration, Scaling, Data Security, and Governance

7. Chapter 5: Setting Up and Configuring EMR Clusters

8. Chapter 6: Monitoring, Scaling, and High Availability

9. Chapter 7: Understanding Security in Amazon EMR

10. Chapter 8: Understanding Data Governance in Amazon EMR

11. Section 3: Implementing Common Use Cases and Best Practices

12. Chapter 9: Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark

13. Chapter 10: Implementing Real-Time Streaming with Amazon EMR and Spark Streaming

14. Chapter 11: Implementing UPSERT on S3 Data Lake with Apache Spark and Apache Hudi

15. Chapter 12: Orchestrating Amazon EMR Jobs with AWS Step Functions and Apache Airflow/MWAA

16. Chapter 13: Migrating On-Premises Hadoop Workloads to Amazon EMR

17. Chapter 14: Best Practices and Cost-Optimization Techniques

18. Other Books You May Enjoy

Decoupling compute and storage

When you integrate an EMR cluster for your batch or streaming workloads, you have the option to use the core node's HDFS as your primary distributed storage or Amazon S3 as your distributed storage layer. As you know, Amazon S3 provides a highly durable and scalable storage solution and Amazon EMR natively integrates with it.

With Amazon S3 as the cluster's distributed storage, you can decouple compute and storage, which gives additional flexibility. It enables you to integrate job-based transient clusters, where S3 acts as a permanent store and the cluster core node's HDFS is used for temporary storage. This way, you can decouple different jobs to have their own cluster with the required amount of resources and scaling in place and avoid having an always-on cluster to save costs.

The following diagram represents how multiple transient EMR clusters that contain various steps can use S3 as their common persistent storage layer. This can also help for disaster recovery implementation:

Figure 1.3 – Multiple EMR clusters using Amazon S3 as their distributed storage

Now that you understand how EMR provides flexibility to decouple compute and storage, in the next section, you will learn how you can use this feature to create persistent or transient clusters depending on your use case.

Persistent versus transient clusters

Persistent clusters represent a cluster that is always active to support multi-tenant workloads or interactive analytics. These clusters can have a constant node capacity or a minimal set of nodes with autoscaling capabilities. Autoscaling is a feature of EMR, where EMR automatically scales up (adds nodes) or scales down (removes nodes) cluster resources based on a few cluster utilization parameters. In future chapters, we will dive deep into EMR scaling features and options.

Transient clusters are treated more as job-based clusters, which are short-lived. They get created with data arrival or through scheduled events, do the data processing, write the output back to target storage, and then get terminated. These also have a constant set of nodes to start with and then scale to support the additional workloads. But when you have transient cluster workloads, ideally Amazon S3 is used as a persistent data store so that after cluster termination, you still have access to the data to perform additional ETL or business intelligence reporting.

Here is a diagram that represents different kinds of cluster use cases you may have:

Figure 1.4 – EMR architecture representing cluster nodes

As you can see, all three clusters are using Amazon S3 as their persistent storage layer, which decouples compute and storage. This will facilitate scaling for both compute and storage independently, where Amazon S3 provides scaling with 99.999999999% (11 9s) durability and the cluster compute capacity can scale horizontally by adding more core or task nodes.

As represented in the diagram, transient clusters can be scheduled jobs or multiple workload-specific clusters running in parallel to do ETL on their datasets, where each workload cluster might have workload-specific cluster capacity.

When you implement transient clusters, often the best practice is to externalize your Hive Metastore, which means if your cluster gets terminated and becomes active again, it does not need to create Metastore or catalog tables again. When you are externalizing Hive Metastore of your EMR cluster, you have the option to use an Amazon RDS database as a Hive Metastore or you can use AWS Glue Data Catalog as your Metastore.

You're reading from Simplify Big Data Analytics with Amazon EMR A beginner's guide to learning and implementing Amazon EMR for building data analytics solutions

Table of Contents (19) Chapters

Decoupling compute and storage

Persistent versus transient clusters

Authors (1)

Personalised recommendations for you