Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

Uber’s Marmaray, an Open Source Data Ingestion and Dispersal Framework for Apache Hadoop

Save for later
  • 3 min read
  • 14 Sep 2018

article-image

Uber came out with an open source data ingestion and dispersal framework for Apache Hadoop, called “Marmaray”, yesterday. Marmaray is a plug-in based framework built and designed on top of the Hadoop ecosystem by the Hadoop Platform team. Marmaray helps connect a collection of systems and services in a cohesive manner to be able to perform certain functions. Let’s have a look at these functions.

Major Functions

  • Marmaray is capable of producing quality schematized data via Uber’s schema management library and services.
  • It ingests data from multiple data stores into Uber’s Hadoop data lake.
  • It can build pipelines using Uber’s internal workflow orchestration service. This allows it to crunch and process the ingested data along with storing and calculating the business metrics based on this data in Hive.
  • Marmaray serves the processed results from Hive to an online data store. This allows the internal customers to query the data and get close to instant results.


Other than that, a majority of the fundamental building blocks and abstractions for Marmaray’s design were inspired by Gobblin, a similar project developed at LinkedIn.

Marmaray Architecture


There are certain generic components such as DataConverters, WorkUnitCalculator, Metadata Manager, ISourceand ISink in Marmaray that facilitates its overall job flow. Let’s discuss these components.

ubers-marmaray-an-open-source-data-ingestion-and-dispersal-framework-for-apache-hadoop-img-0

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime

 Marmaray Architecture


DataConverters


DataConverters are responsible for producing the error records with every transformation. It is important for all the raw data to conform to a schema before it is ingested into Uber’s Hadoop data lake, this is where DataConverts come into picture. It filters out any data that is malformed, missing required fields, or has other issues.

WorkUnitCalculator


Uber introduced the concept of WorkUnitCalculator in order to measure the amount of data to process. At advanced levels, WorkUnitCalculator analyzes the type of input source and the previously stored checkpoint. It then calculates the next work unit or batch of work.

The WorkUnitCalculator also considers throttling information when measuring the next batch of data which needs processing.

Metadata Manager


The Metadata Manager is responsible to cache job level metadata information. The metadata store is capable of storing any relevant metrics which are useful to track, describe, or collect status on jobs. This helps Marmaray to cache job level metadata information.

ISource and ISink


The ISource consists of necessary information from the source data required for the appropriate work units, and ISink comprises all the necessary information on writing to the sink.

Marmaray’s support for any-source to any-sink data pipelines can be applied to a wide range of use cases both in the Hadoop ecosystem and for data migration.

“We hope that Marmaray will serve the data needs of other organizations, and that open source developers will broaden its functionalities,” reads the Uber Blog.

For more information, check out the official Uber Blog.

Uber open sources its large scale metrics platform, M3 for Prometheus

Uber introduces Fusion.js, a plugin-based web development framework for high performance apps

Uber’s kepler.gl, an open source toolbox for GeoSpatial Analysis