You're reading from Simplify Big Data Analytics with Amazon EMR A beginner's guide to learning and implementing Amazon EMR for building data analytics solutions

Product type Paperback

Published in Mar 2022

Publisher Packt

ISBN-13 9781801071079

Length 430 pages

Edition 1st Edition

Tools

Amazon EMR

Concepts

Big Data

Author (1):

Sakti Mishra

Preface

1. Section 1: Overview, Architecture, Big Data Applications, and Common Use Cases of Amazon EMR

2. Chapter 1: An Overview of Amazon EMR FREE CHAPTER

3. Chapter 2: Exploring the Architecture and Deployment Options

4. Chapter 3: Common Use Cases and Architecture Patterns

5. Chapter 4: Big Data Applications and Notebooks Available in Amazon EMR

6. Section 2: Configuration, Scaling, Data Security, and Governance

7. Chapter 5: Setting Up and Configuring EMR Clusters

8. Chapter 6: Monitoring, Scaling, and High Availability

9. Chapter 7: Understanding Security in Amazon EMR

10. Chapter 8: Understanding Data Governance in Amazon EMR

11. Section 3: Implementing Common Use Cases and Best Practices

12. Chapter 9: Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark

13. Chapter 10: Implementing Real-Time Streaming with Amazon EMR and Spark Streaming

14. Chapter 11: Implementing UPSERT on S3 Data Lake with Apache Spark and Apache Hudi

15. Chapter 12: Orchestrating Amazon EMR Jobs with AWS Step Functions and Apache Airflow/MWAA

16. Chapter 13: Migrating On-Premises Hadoop Workloads to Amazon EMR

17. Chapter 14: Best Practices and Cost-Optimization Techniques

18. Other Books You May Enjoy

Test your knowledge

Before moving on to the next chapter, test your knowledge with the following questions:

You receive a daily incremental file from your source system at midnight and you are expected to process it and make it available for consumption. After that, during the day, at 3 P.M., you need to execute a machine learning job that will read this processed output. Will you use a persistent cluster or transient and how will you configure it?
While creating an EMR cluster, you have a requirement to select multiple instance types for your node types and would like to take advantage of spot instances too. How would you configure your cluster?
You have a manufacturing unit that expects all the Hadoop/Spark processing to happen near its on-premises site, but has plans to slowly migrate to the cloud. Which Amazon EMR deployment option is best suited?

The rest of the chapter is locked