Data processing using Amazon EMR
Amazon EMR is a platform that enables big data processing at a scale. It’s a managed service that contains over 20 open source frameworks, including popular data processing engines such as Hadoop, Spark, Hive, Presto, and Trino. It was specifically created keeping in mind all the challenges we went through with a data processing platform.
There is so much information about EMR that a separate book exists that describes in detail each and every aspect of EMR. However, the purpose of this book is not to explain all of these aspects in detail but to understand when EMR can be used and which use cases it helps to solve. But let’s first get an overview of EMR.
Amazon EMR overview
EMR provides all the necessary tools that are required to process data at scale. EMR manages the underlying software and hardware needed to provide a cost-effective, scalable, and easy-to-manage data platform. The best way to get an overview of a service is to understand what it brings to the table, so let’s quickly look at all the benefits of using EMR:
- Separation of compute from storage allows the EMR clusters to be turned off when there are no workloads to be processed. This separation is achieved due to the fact that data is stored in Amazon S3 and EMR is able to read from S3 and write to S3, instead of HDFS. This transparent way of dealing with S3 using the EMR File System (EMRFS) connector allows all existing Hadoop ecosystem frameworks to work seamlessly with S3 as the storage layer.
- Built-in disaster recovery (DR) allows the EMR cluster to keep operating even if certain nodes go down. And since the data is highly durable and available in S3, even in the event of compute issues, the data is still safe for reprocessing.
- EMR provides multiple versions of open source projects to choose from, and the latest version is updated in EMR within 30 days of getting released in the community version. This allows organizations to leverage the latest features from these projects.
- An EMR cluster can auto-scale depending on the workloads. This not only ensures that jobs get completed within the desired timeframe but also ensures that idle compute is scaled down so that unnecessary cost is not incurred.
- With four modes of setup—EMR on Amazon Elastic Compute Cloud (Amazon EC2), EMR on Amazon Elastic Kubernetes Service (Amazon EKS), EMR Serverless, and EMR on AWS Outposts, organizations can pick and choose the best kind of deployment based on specific jobs or applications. More details on types of EMR setups can be found in the AWS documentation portal (https://docs.aws.amazon.com).
- EMR can also leverage spot instances to bring costs substantially down, sometimes providing savings of up to 90% of the on-demand pricing.
- Performance improvements in the EMR runtime, for Apache Spark and other projects, make running Spark on EMR faster than the open source version of Spark. Also, with custom Graviton instances on AWS, Spark performance is improved even more.
- Integrated with EMR Studio, which makes collaborative work using Jupyter Notebook easier for data engineers and data scientists.
- Provides extensive security options, including authentication, authorization, encryption, infrastructure protection, logging, monitoring, and notifications.
- Finally, all the integration synergies with other AWS services make the analytics ecosystem even more powerful, cost-effective, and easy to leverage.
The following diagram shows a high-level architecture of an EMR cluster. It consists of a leader node(s) that manages the cluster; the core nodes provide compute, memory, and storage; and the task nodes only provide additional compute and memory, making it a perfect choice for assigning spot instances to it:
Figure 5.1 – EMR cluster architecture
An EMR cluster also provides the flexibility to assign different types of compute instances depending on the type of workload. The following diagram highlights the types of instances and the scenarios where they are beneficial:
Figure 5.2 – EMR compute flexibility
Typically, EMR is used for the migration of large self-managed Hadoop clusters or processing extremely large datasets that can benefit from the price performance of the EMR platform. The data processing projects that are in EMR are geared toward data engineers who can create Spark jobs in EMR and build their ETL pipelines.
Let’s understand the usefulness of EMR by highlighting some use cases from GreatFin.
Use-case scenario 1 – Big data platform migration
Use case for Amazon EMR
Many years ago, GreatFin embarked on a journey to create a centralized big data processing platform by leveraging open source frameworks such as Hadoop, Spark, and Hive. The platform was created on-premises, and the team was able to customize and operationalize this platform so that data engineers could build data pipelines for their respective lines of business (LOBs).
Due to recent exponential growth in data, many visible cracks have started to appear in the home-grown big data platform—scalability issues, reliability issues, multiple outages, constant upgrades and maintenance, performance issues, and, of course, growing costs. All these challenges have led GreatFin to lose focus on business outcomes. The leadership team now wants to remove all barriers from its data processing platform so that everyone spends more time on business outcomes and less time on managing the data processing platform.
This use case clearly highlights the challenges of self-managing a big data processing platform. We will look at how Amazon EMR can alleviate the pain points around this use case.
If you recall from Chapter 2, we discussed different table formats for setting up a transactional data lake in S3. EMR supports these table formats to provide the perfect execution engine for processing all data between the layers of the data lake. Also, EMR can read the metadata of the data from the Glue Data Catalog. This whole setup makes EMR a perfect service to process and transform the data that goes into the data lake in S3. The following diagram highlights this aspect of EMR:
Figure 5.3 – EMR projects and their seamless support for transactional data lakes
This use case requires the migration of the data processing platform from a self-managed to an AWS-managed EMR platform. This will allow the data engineers to focus more on business-specific tasks and less on managing and maintaining the infrastructure behind them. Also, by moving to EMR, GreatFin will not only get all the benefits of EMR but will also be able to substantially bring down the total cost of ownership (TCO) of its data processing platform.
The following diagram highlights the flow of how EMR sits in between the data layers in the S3 data lake so that it can seamlessly process and transform data as it passes through the different layers:
Figure 5.4 – EMR as the data progressing platform in the S3 data lake
Let’s look at another typical use case where EMR comes in really handy.
Use-case scenario 2 – Collaborative data engineering
Use case for Amazon EMR Studio
Data engineers at GreatFin have been building data pipelines for many years now; however, that whole process is isolated per team and is cumbersome to prototype, test, and debug. Overall, it’s not easy to collaborate and build new data processing applications, and the whole process is far from being agile.
For data engineers, Jupyter notebooks are central to their line of work as it allows developers to test code before they can deploy it. However, managing these notebooks is not trivial. Data engineers or even data scientists need to focus less on the notebook infrastructure and more on the business logic they need to create. They need an easy way to build applications by collaborating with others in the team. They need all the bells and whistles that would make it easy for them to test, debug, and deploy code in production. That’s where EMR Studio comes into the picture.
EMR Studio is a fully managed IDE within the EMR service that allows for interactive data analytics. It has fully managed Jupyter notebooks, integration with GitHub repositories, a simplified UI for debugging code, an easy way to create and delete EMR clusters, integration with workflow orchestration services, and so forth.
The following diagram is a typical flow for data engineers to make it easy for them to bring their work from the prototype phase all the way into production, with the least amount of time and effort:
Figure 5.5 – Typical data engineering workflow
EMR has over 20 open source projects that cater to different aspects of data; we will not get into use cases for each of them in this book. However, we will come back to EMR in our future chapters to discuss certain frameworks that would come in handy to solve those particular use cases. For now, let’s move on to our beloved service that has come up in most chapters so far—AWS Glue for data processing.