EMR architecture deep dive
The following is a high-level architecture of Amazon EMR, which includes various components, such as the distributed storage layer, cluster resource management with Yet Another Resource Negotiator (YARN), batch or stream processing frameworks, and different Hadoop applications.
Apart from these major components, the following architecture also represents monitoring with Ganglia, the Hue user interface, Zeppelin notebook, Livy server, and connectors that enable integration with other AWS services:
Now let's discuss each of these components in detail.
Distributed storage layer
In a typical on-premises Hadoop cluster or Hadoop on EC2 architectures, you will notice the Hadoop cluster node's disk space contributes to Hadoop Distributed File System (HDFS) storage space, and the storage and compute are tightly coupled.
But...