Chapter 3: Data Ingestion
In the previous chapter, we discussed the fundamental concepts and inner workings of the various features/microservices that are available in AWS Glue, such as Glue Data Catalog, connections, crawlers, and classifiers, the schema registry, Glue ETL jobs, development endpoints, interactive sessions, and triggers. We also explored how AWS Glue crawlers aid in data discovery by crawling different types of data stores – Amazon S3, JDBC (Amazon RDS or on-premises databases), and DynamoDB/MongoDB/DocumentDB infer the schema and populate AWS Glue Data Catalog. While discussing Glue ETL in the previous chapter, we introduced a few of the important extensions/features of Spark ETL, including GlueContext
, DynamicFrame
, JobBookmark
, and GlueParquet
. In this chapter, we will see them in action by looking at some examples.
In this chapter, we will be discussing some of the components of AWS Glue mentioned in the previous paragraph – specifically Glue...