The team at Apache Hadoop released Apache Hadoop 3.2.0, an open source software platform for distributed storage and for processing of large data sets. This version is the first in the 3.2 release line and is not generally available or production ready, yet.
This release features Node Attributes that help in tagging multiple labels on the nodes based on their attributes. It further helps in placing the containers based on the expression of these labels. It is not associated with any queue and hence there is no need to queue resource planning and authorization for attributes.
This release comes with Hadoop Submarine that enables data engineers for developing, training and deploying deep learning models in TensorFlow on the same Hadoop YARN cluster where data resides. It also allows jobs for accessing data/models in HDFS (Hadoop Distributed File System) and other storages. It supports user-specified Docker images and customized DNS name for roles such as tensorboard.$user.$domain:6006.
Storage policy satisfier supports HDFS applications to move the blocks between storage types as they set the storage policies on files/directories. It is also a solution for decoupling storage capacity from compute capacity.
This release comes with support for an enhanced S3A connector, including better resilience to throttled AWS S3 and DynamoDB IO.
It supports the latest Azure Datalake Gen2 Storage.
To know more about this release, check out the release notes on Hadoop’s official website.
Why did Uber created Hudi, an open source incremental processing framework on Apache Hadoop?
Uber’s Marmaray, an Open Source Data Ingestion and Dispersal Framework for Apache Hadoop
Setting up Apache Druid in Hadoop for Data visualizations [Tutorial]