Many data scientists and machine learning (ML) practitioners face the problem of scale when attempting to run ML data pipelines over big data. In this chapter, we will focus primarily on Elastic MapReduce (EMR), which is a very powerful tool for running very large ML jobs. There are many ways to configure EMR and not every setup works for every scenario. In this chapter, we will outline the main configurations of EMR and how each configuration works for different objectives. Additionally, we will present AWS Glue as a tool to catalog the results of our big data pipelines.
In this chapter, we will cover the following topics:
- Introduction to the EMR architecture
- Tuning EMR for different applications
- Managing data pipelines with Glue