Data ingestion from file/object stores
This is one of the most common use cases for Glue ETL, where the source data is already available in file storage or cloud-based object stores. Here, depending on the type of job being executed, the methods or libraries used to access the data store differ.
There are several file/object storage services available today – Amazon S3, HDFS, Azure Storage, Google Cloud Storage, IBM Cloud Object Storage, FTP, SFTP, and HTTP(s) to name a few. In this section, we will focus on two of the most popular file/object stores that are used with AWS Glue – Amazon S3 and HDFS.
Data ingestion from Amazon S3
Data ingestion from Amazon S3 is by far the most commonly used design pattern for ETL in AWS Glue. Most organizations already have some mechanism to move data to Amazon S3, typically by using the AWS CLI/SDKs directly, AWS Transfer Family (https://aws.amazon.com/aws-transfer-family/), or some other third-party tools.
If we are using...