By now, you have got a good overview of Amazon EMR and its architecture, which can help you visualize how you can execute your Hadoop workloads on Amazon EMR.
But when you build an enterprise architecture for a data analytics pipeline, be it batch or real-time streaming, there are a lot of additional benefits to running in AWS. You can decouple your architecture into multiple components and integrate various other AWS services to build a fault-tolerant, scalable architecture that is highly secure.
Figure 1.5 – Representing EMR integration with other AWS services
The preceding figure is a high-level diagram that shows how you can integrate a few other AWS services with Amazon EMR for an analytics pipeline. These are just a few sets of services listed to give you an idea, but there are a lot of other AWS services that you can integrate which you deem fit for your use case.
Now let's get an overview of these services and understand how they integrate with Amazon EMR.
Amazon S3 with EMR File System (EMRFS)
Out of all the AWS services, Amazon S3 takes the top spot as any data analytics architecture built on top of AWS will have S3 as a persistent or intermediate data store. When we build a data processing pipeline with Amazon EMR, S3 integration is natively supported through EMR File System (EMRFS). When a job communicates with an Amazon S3 path to read or write data, it can access S3 with the s3://
prefix.
Amazon Kinesis Data Streams (KDS)
Amazon Kinesis Data Streams (KDS) is a commonly used messaging service within AWS to build real-time streaming pipelines for use cases such as website clickstreams, application log streams, and Internet of Things (IoT) device event streams. It is scalable and durable and continuously captures gigabytes of data per second with multiple sources ingesting to it and multiple consumers reading from it in parallel.
It provides Kinesis Producer Library (KPL), which data producers can integrate to push data to Kinesis, and also provides Kinesis Consumer Library (KCL), which data-consuming applications can integrate to access the data.
When we build a real-time streaming pipeline with EMR and KDS as a source, we can use Spark Structured Streaming, which integrates KCL internally to access the stream datasets.
Amazon Managed Streaming for Kafka (MSK)
Similar to KDS, Apache Kafka is also a popular messaging service in the open source world that is capable of handling massive volumes of data for real-time streaming. But it comes with the additional overhead of managing the infrastructure.
Amazon Managed Streaming for Kafka (MSK) is a fully managed service built on top of open source Apache Kafka that automates Kafka cluster creation and maintenance. You can set up a Kafka cluster with a few clicks and use that as an event message source when you plan to implement a real-time streaming use case with EMR and Spark Streaming as the processing framework.
AWS Glue Data Catalog
AWS Glue is a fully managed ETL service that is built on top of Apache Spark with additional functionalities, such as Glue crawlers and Glue Data Catalog. Glue crawlers help autodetect the schema of source datasets and create virtual tables in Glue Data Catalog.
With EMR 5.8.0 or later, you can configure Spark SQL in EMR to use AWS Glue Data Catalog as its external metastore. This is great when you have transient cluster scenarios that need an external persistent metastore or multiple clusters sharing a common catalog.
Amazon Relational Database Service (RDS)
Similar to Glue Data Catalog, you can also use Amazon Relational Database Service (RDS) to be the external metastore for Hive, which can be shared between multiple clusters as a persistent metastore.
Apart from being used as an external metastore, in a few use cases, Amazon RDS is also used as an operational data store for reporting to which data gets ingested through EMR big data processing, which pushes aggregated output to RDS for real-time reporting.
Amazon DynamoDB
Amazon DynamoDB is an AWS-hosted, fully managed, scalable NoSQL database that delivers quick, predictable performance. As it's serverless, it takes away the infrastructure management overhead and also provides all security features, including encryption at rest.
In a few analytical use cases, DynamoDB is used to store data ingestion or extraction-related checkpoint information and you can use DynamoDB APIs with Spark to query the information or define Hive external tables with a DynamoDB connector to query them.
Amazon Redshift
Amazon Redshift is an MPP data warehousing service of AWS using which you can query and process exabytes of structured or semi-structured data. In the data analytics world, having a data warehouse or data mart is very common and Redshift can be used for both.
In the data analytics use cases, it's a common pattern that after your ETL pipeline processing is done, the aggregated output gets stored in a data warehouse or data mart and that is where the EMR-to-Redshift connection comes into the picture. Once EMR writes output to Redshift, you can integrate business intelligence reporting tools on top of it.
AWS Lake Formation
AWS Lake Formation is a service that enables you to integrate granular permission management on your data lake in AWS. When you define AWS Glue Data Catalog tables on top of a data lake, you can use AWS Lake Formation to define access permissions on databases, tables, and columns available in the same or other AWS accounts. This helps in having centralized data governance, which manages permissions for AWS accounts across an organization.
In EMR, when you try to pull data from Glue Data Catalog tables and use it as an external metastore, then your EMR cluster processes such as Spark will go through Lake Formation permissions to access the data.
AWS Identity and Access Management (IAM)
AWS Identity and Access Management (IAM) enables you to integrate authentication and authorization for accessing AWS services through the console or AWS APIs. You can create groups, users, or roles and define policies to give or restrict access to specific resources or APIs.
While creating an EMR cluster or accessing its API resources, every request goes through IAM policies to validate the access.
AWS Key Management Service (KMS)
When you think of securing your data while it's being transferred through the network or being stored in a storage layer, you can think of cryptographic keys and integrating an encryption and decryption mechanism. To implement this, you need to store your keys in a secured place that integrates with your application well and AWS Key Management Service (KMS) makes that simple for you. AWS KMS is a highly secure and resilient solution that protects your keys with hardware security modules.
Your EMR cluster can interact with AWS KMS to get the keys for encrypting or decrypting the data while it's being stored or transferred between cluster nodes.
Lake House architecture overview
Lake House is a new architecture pattern that tries to address the shortcomings of data lakes and combines the best of data lakes and data warehousing. It acknowledges that the one-size-fits-all strategy to analytics eventually leads to compromises. It is not just about connecting a data lake to a data warehouse or making data lake access more structured; it's also about connecting a data lake, a data warehouse, and other purpose-built data storage to enable unified data management and governance.
In AWS, you can use Amazon S3 as a data lake, Amazon EMR or AWS Glue for ETL transformations, and Redshift for data warehousing. Then, you can integrate other relational NoSQL data stores on top of it to solve different big data or machine learning use cases.
The following diagram is a high-level representation of how you can integrate the Lake House architecture in AWS:
Figure 1.6 – Lake House architecture reference
As you can see in the preceding diagram, we have the Amazon S3 data lake in the center, supported by AWS Glue for serverless ETL and AWS Lake Formation for granular permission management.
Around the centralized data lake, we have the following:
- Amazon EMR for batch or streaming big data processing
- Amazon OpenSearch service for log analytics or search use cases
- Amazon Redshift for data warehousing or data mart use cases
- Amazon DynamoDB for key-value NoSQL store
- Amazon Aurora for operational reporting or external metastore
- Amazon SageMaker for machine learning model training and inference
As explained previously, the Lake House architecture represents how you can bring in the best of multiple services to build an ecosystem that addresses your organization's analytics needs.