Apache Hudi overview
Apache Hudi is an open source framework, which is popular for providing record-level transaction support on top of data lakes. The Hudi framework supports integration with open file formats such as Parquet and stores additional metadata for its operations.
Apache Hudi provides several capabilities and the following are the most popular ones:
- UPSERT on top of data lakes
- Support for transactions and rollbacks
- Integration with popular distributed processing engines such as Spark, Hive, Presto, and Trino
- Automatic file compaction in data lakes
- The option to query recent update views or past transaction snapshots
Hudi supports both read and write-heavy workloads. When you write data to an Amazon S3 data lake using Hudi APIs, you have the option to specify either of the following storage types:
- Copy on Write (CoW): This is the default storage type, which creates a new version of the file and stores the output in Parquet format...