Data discovery with AWS Glue
One of the unique features that sets AWS Glue apart from other ETL tools is its ability to create a centralized data catalog. This catalog is crucial for performing data discovery and relies on two important components of Glue:
- Glue Data Catalog
- Glue Data Crawler
AWS Glue Data Catalog
A data catalog is a centralized storage of metadata for data stored in different data stores, such as data lakes, data warehouses, relational databases, and non-relational databases. The metadata contains information about columns, data formats, locations, and serialization/deserialization mechanisms. Hive Metastore is one of the most popular metadata products used in the industry. However, it uses relational database management systems (RDBMSs) such as MySQL and PostgreSQL. The problem with using an RDBMS for Hive metadata is managing and maintaining it, especially for production workloads where high availability, scaling, and redundancy must be taken...