Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

Why did Uber created Hudi, an open source incremental processing framework on Apache Hadoop?

Save for later
  • 3 min read
  • 19 Oct 2018

article-image

In the process of rebuilding its Big Data platform, Uber created an open-source Spark library named Hadoop Upserts anD Incremental (Hudi). This library permits users to perform operations such as update, insert, and delete on existing Parquet data in Hadoop. It also allows data users to incrementally pull only the changed data, which significantly improves query efficiency. It is horizontally scalable, can be used from any Spark job, and the best part is that it only relies on HDFS to operate.

Why is Hudi introduced?


Uber studied its current data content, data access patterns, and user-specific requirements to identify problem areas. This research revealed the following four limitations:

Scalability limitation in HDFS


Many companies who use HDFS to scale their Big Data infrastructure face this issue. Storing large numbers of small files can affect the performance significantly as HDFS is bottlenecked by its NameNode capacity. This becomes a major issue when the data size grows above 50-100 petabytes.

Need for faster data delivery in Hadoop


Since Uber operates in real time, there was a need for providing services the latest data. It was important to make the data delivery much faster, as the 24-hour data latency was way too slow for many of their use cases.

No direct support for updates and deletes for existing data


Uber used snapshot-based ingestion of data, which means a fresh copy of source data was ingested every 24 hours. As Uber requires the latest data for its business, there was a need for a solution which supports update and delete operations for existing data. However, since their Big Data is stored in HDFS and Parquet, direct support for update operations on existing data is not available.

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime

Faster ETL and modeling


ETL and modeling jobs were also snapshot-based, requiring their platform to rebuild derived tables in every run. ETL jobs also needed to become incremental to reduce data latency.

How Hudi solves the aforementioned limitations?


The following diagram shows Uber's Big Data platform after the incorporation of Hudi:

why-uber-created-hudi-an-open-source-incremental-processing-framework-on-apache-hadoop-img-0

Source: Uber


Regardless of whether the data updates are new records added to recent date partitions or updates to older data, Hudi allows users to pass on their latest checkpoint timestamp and retrieve all the records that have been updated since. This data retrieval happens without running an expensive query that scans the entire source table.

Using this library Uber has moved to an incremental ingestion model leaving behind the snapshot-based ingestion. As a result, the data latency was reduced from 24 hrs to less than one hour.

To know about Hudi in detail, check out Uber’s official announcement.

How can Artificial Intelligence support your Big Data architecture?

Big data as a service (BDaaS) solutions: comparing IaaS, PaaS and SaaS

Uber’s Marmaray, an Open Source Data Ingestion and Dispersal Framework for Apache Hadoop