Creating a Glue job for ETL
AWS Glue (https://aws.amazon.com/glue) supports data processing in a serverless fashion. The computational resource of Glue is managed by AWS, so less effort is needed for maintenance, unlike in the case of dedicated clusters (for example, EMR). Other than the minimal maintenance effort for the resources, Glue provides additional features such as a built-in scheduler and Glue Data Catalog, which will be discussed later.
First, let’s learn how to set up data processing jobs using Glue. Before you start defining the logic for data processing, you must create a Glue Data Catalog that contains the schema for the data in S3. Once a Glue Data Catalog has been defined for the input data, you can use the Glue Python editor to define the details of the data processing logic (Figure 5.8). The editor provides a basic setup for your application to reduce the difficulties in setting up a Glue job: https://docs.aws.amazon.com/glue/latest/dg/edit-script.html...