Amazon EMR
Amazon EMR is a platform for leveraging many big data tools for data processing. We will start by looking at the concepts of MapReduce and Hadoop.
MapReduce and Hadoop
MapReduce and Hadoop are two related concepts in the field of distributed computing and big data processing.
The idea of MapReduce is “divide and conquer”: decompose a big dataset into smaller ones to be processed in parallel on distributed computers. It was originally developed by Google for its search engine to handle the massive amounts of data generated by web crawling. The MapReduce programming model involves two functions: a map function that divides and processes in parallel the datasets and a map function that aggregates the map outputs.
Hadoop is an open source software framework that implements the MapReduce model. Hadoop consists of two core components: Hadoop Distributed File System (HDFS) and MapReduce. HDFS is a distributed filesystem that can rapidly transfer data between...