Preface
As the usage of internet-related services, computers, and smart products increases, the amount of data produced by them has also increased exponentially. The data produced by them is extremely valuable for addressing business problems, as you can analyze the data to derive insights that can help in faster decision making and forecasting business growth.
These datasets are large and complex enough that traditional data processing technologies can't handle them efficiently, and that is why distributed processing frameworks such as Hadoop and Spark evolved. Amazon Elastic MapReduce (EMR) provides a managed offering for Hadoop ecosystem services, so that businesses can focus on building analytics pipelines and save time on managing infrastructure. This makes Amazon EMR the top choice for Hadoop, Spark, and big data workloads.
As the amount of data continues to grow, big data analytics will become a common skill that everybody will need to have to be successful in their career or business. Before EMR, it was expensive to try out Hadoop or Spark workloads as they require clusters of servers for setup. But with Amazon EMR's pay-as-you-go model, you can spin up small clusters quickly, scale them as needed, and terminate them when the job finishes.
Organizations that want to get started with Amazon EMR or are planning to migrate existing Hadoop workloads to EMR, as well as college-fresh graduates who want to upskill in EMR, will find this book very useful and will be able to dive deep into different EMR features and architecture patterns.
While writing this book, I have kept in mind that it should be useful to both beginners and technologists who want to learn advanced concepts of EMR. I also expect you to have some basic knowledge of AWS and Hadoop so that you can understand better and easily dive deep into advanced concepts.
By the end of this book, you will be able to comfortably architect and implement Hadoop-/Spark-based solutions with transient (job-based) or persistent (multi-tenant/long-running) EMR clusters. In addition, you will be able to understand how a complete end-to-end data analytics solution can be implemented with Amazon EMR for batch, real-time streaming, or interactive workloads. You will also gain knowledge about migration approaches, best practices, and cost optimization techniques that you can follow while implementing big data analytics workloads with EMR.