Uber came out with an open source data ingestion and dispersal framework for Apache Hadoop, called “Marmaray”, yesterday. Marmaray is a plug-in based framework built and designed on top of the Hadoop ecosystem by the Hadoop Platform team. Marmaray helps connect a collection of systems and services in a cohesive manner to be able to perform certain functions. Let’s have a look at these functions.
Other than that, a majority of the fundamental building blocks and abstractions for Marmaray’s design were inspired by Gobblin, a similar project developed at LinkedIn.
There are certain generic components such as DataConverters, WorkUnitCalculator, Metadata Manager, ISourceand ISink in Marmaray that facilitates its overall job flow. Let’s discuss these components.
DataConverters are responsible for producing the error records with every transformation. It is important for all the raw data to conform to a schema before it is ingested into Uber’s Hadoop data lake, this is where DataConverts come into picture. It filters out any data that is malformed, missing required fields, or has other issues.
Uber introduced the concept of WorkUnitCalculator in order to measure the amount of data to process. At advanced levels, WorkUnitCalculator analyzes the type of input source and the previously stored checkpoint. It then calculates the next work unit or batch of work.
The WorkUnitCalculator also considers throttling information when measuring the next batch of data which needs processing.
The Metadata Manager is responsible to cache job level metadata information. The metadata store is capable of storing any relevant metrics which are useful to track, describe, or collect status on jobs. This helps Marmaray to cache job level metadata information.
The ISource consists of necessary information from the source data required for the appropriate work units, and ISink comprises all the necessary information on writing to the sink.
Marmaray’s support for any-source to any-sink data pipelines can be applied to a wide range of use cases both in the Hadoop ecosystem and for data migration.
“We hope that Marmaray will serve the data needs of other organizations, and that open source developers will broaden its functionalities,” reads the Uber Blog.
For more information, check out the official Uber Blog.
Uber open sources its large scale metrics platform, M3 for Prometheus
Uber introduces Fusion.js, a plugin-based web development framework for high performance apps
Uber’s kepler.gl, an open source toolbox for GeoSpatial Analysis