In Chapter 4, Predicting User Behavior with Tree-Based Methods, we introduced EMR, which is an AWS service that allows us to run and scale Apache Spark, Hadoop, HBase, Presto, Hive, and other big data frameworks. These big data frameworks typically require a cluster of machines running specific pieces of software that are correctly configured so that the machines are able to communicate with each other. Let's look at the most commonly used products within EMR.
Introduction to the EMR architecture
Apache Hadoop
Many applications, such as Spark and HBase, require Hadoop. The basic installation of Hadoop comes with two main services:
- Hadoop Distributed File System (HDFS): This is a service that allows us to store large...