Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Simplify Big Data Analytics with Amazon EMR

You're reading from   Simplify Big Data Analytics with Amazon EMR A beginner's guide to learning and implementing Amazon EMR for building data analytics solutions

Arrow left icon
Product type Paperback
Published in Mar 2022
Publisher Packt
ISBN-13 9781801071079
Length 430 pages
Edition 1st Edition
Concepts
Arrow right icon
Author (1):
Arrow left icon
Sakti Mishra Sakti Mishra
Author Profile Icon Sakti Mishra
Sakti Mishra
Arrow right icon
View More author details
Toc

Table of Contents (19) Chapters Close

Preface 1. Section 1: Overview, Architecture, Big Data Applications, and Common Use Cases of Amazon EMR
2. Chapter 1: An Overview of Amazon EMR FREE CHAPTER 3. Chapter 2: Exploring the Architecture and Deployment Options 4. Chapter 3: Common Use Cases and Architecture Patterns 5. Chapter 4: Big Data Applications and Notebooks Available in Amazon EMR 6. Section 2: Configuration, Scaling, Data Security, and Governance
7. Chapter 5: Setting Up and Configuring EMR Clusters 8. Chapter 6: Monitoring, Scaling, and High Availability 9. Chapter 7: Understanding Security in Amazon EMR 10. Chapter 8: Understanding Data Governance in Amazon EMR 11. Section 3: Implementing Common Use Cases and Best Practices
12. Chapter 9: Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark 13. Chapter 10: Implementing Real-Time Streaming with Amazon EMR and Spark Streaming 14. Chapter 11: Implementing UPSERT on S3 Data Lake with Apache Spark and Apache Hudi 15. Chapter 12: Orchestrating Amazon EMR Jobs with AWS Step Functions and Apache Airflow/MWAA 16. Chapter 13: Migrating On-Premises Hadoop Workloads to Amazon EMR 17. Chapter 14: Best Practices and Cost-Optimization Techniques 18. Other Books You May Enjoy

Decoupling compute and storage

When you integrate an EMR cluster for your batch or streaming workloads, you have the option to use the core node's HDFS as your primary distributed storage or Amazon S3 as your distributed storage layer. As you know, Amazon S3 provides a highly durable and scalable storage solution and Amazon EMR natively integrates with it.

With Amazon S3 as the cluster's distributed storage, you can decouple compute and storage, which gives additional flexibility. It enables you to integrate job-based transient clusters, where S3 acts as a permanent store and the cluster core node's HDFS is used for temporary storage. This way, you can decouple different jobs to have their own cluster with the required amount of resources and scaling in place and avoid having an always-on cluster to save costs.

The following diagram represents how multiple transient EMR clusters that contain various steps can use S3 as their common persistent storage layer. This can also help for disaster recovery implementation:

Figure 1.3 – Multiple EMR clusters using Amazon S3 as their distributed storage

Figure 1.3 – Multiple EMR clusters using Amazon S3 as their distributed storage

Now that you understand how EMR provides flexibility to decouple compute and storage, in the next section, you will learn how you can use this feature to create persistent or transient clusters depending on your use case.

Persistent versus transient clusters

Persistent clusters represent a cluster that is always active to support multi-tenant workloads or interactive analytics. These clusters can have a constant node capacity or a minimal set of nodes with autoscaling capabilities. Autoscaling is a feature of EMR, where EMR automatically scales up (adds nodes) or scales down (removes nodes) cluster resources based on a few cluster utilization parameters. In future chapters, we will dive deep into EMR scaling features and options.

Transient clusters are treated more as job-based clusters, which are short-lived. They get created with data arrival or through scheduled events, do the data processing, write the output back to target storage, and then get terminated. These also have a constant set of nodes to start with and then scale to support the additional workloads. But when you have transient cluster workloads, ideally Amazon S3 is used as a persistent data store so that after cluster termination, you still have access to the data to perform additional ETL or business intelligence reporting.

Here is a diagram that represents different kinds of cluster use cases you may have:

Figure 1.4 – EMR architecture representing cluster nodes

Figure 1.4 – EMR architecture representing cluster nodes

As you can see, all three clusters are using Amazon S3 as their persistent storage layer, which decouples compute and storage. This will facilitate scaling for both compute and storage independently, where Amazon S3 provides scaling with 99.999999999% (11 9s) durability and the cluster compute capacity can scale horizontally by adding more core or task nodes.

As represented in the diagram, transient clusters can be scheduled jobs or multiple workload-specific clusters running in parallel to do ETL on their datasets, where each workload cluster might have workload-specific cluster capacity.

When you implement transient clusters, often the best practice is to externalize your Hive Metastore, which means if your cluster gets terminated and becomes active again, it does not need to create Metastore or catalog tables again. When you are externalizing Hive Metastore of your EMR cluster, you have the option to use an Amazon RDS database as a Hive Metastore or you can use AWS Glue Data Catalog as your Metastore.

You have been reading a chapter from
Simplify Big Data Analytics with Amazon EMR
Published in: Mar 2022
Publisher: Packt
ISBN-13: 9781801071079
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime