Data Preparation in the Cloud
In this chapter, we will learn how data preparation can be set up in the cloud by leveraging various AWS cloud services. Considering the importance of extract, transform, and load (ETL) operations within data preparation, we will take a deeper look into setting up and scheduling ETL jobs in a cost-efficient manner. We will cover four different setups: ETL running on a single-node EC2 instance and an EMR cluster, and then utilizing Glue and SageMaker for ETL jobs. This chapter will also introduce Apache Spark, the most popular framework for ETL. By completing this chapter, you will be able to leverage the different advantages of the presented setups and select the right set of tools for your project.
In this chapter, we’re going to cover the following main topics:
- Data processing in the cloud
- Introduction to Apache Spark
- Setting up a single-node EC2 instance for ETL
- Setting up an EMR cluster for ETL
- Creating a Glue job...