Spark + H2O = Sparkling water for your machine learning needs

[box type="note" align="" class="" width=""]The following is an excerpt from the book Mastering Machine Learning with Spark, Chapter 1, Introduction to Large-Scale Machine Learning and Spark written by Alex Tellez, Max Pumperla, and Michal Malohlava. This article introduces Sparkling water - H2O's integration of their platform within the Spark project, which combines the machine learning capabilities of H2O with all the functionality of Spark. [/box]

H2O is an open source, machine learning platform that plays extremely well with Spark; in fact, it was one of the first third-party packages deemed "Certified on Spark".

Sparkling Water (H2O + Spark) is H2O's integration of their platform within the Spark project, which combines the machine learning capabilities of H2O with all the functionality of Spark. This means that users can run H2O algorithms on Spark RDD/DataFrame for both exploration and deployment purposes. This is made possible because H2O and Spark share the same JVM, which allows for seamless transitions between the two platforms. H2O stores data in the H2O frame, which is a columnar-compressed representation of your dataset that can be created from Spark RDD and/or DataFrame. Throughout much of this book, we will be referencing algorithms from Spark's MLlib library and H2O's platform, showing how to use both the libraries to get the best results possible for a given task.

The following is a summary of the features Sparkling Water comes equipped with:

Use of H2O algorithms within a Spark workflow
Transformations between Spark and H2O data structures
Use of Spark RDD and/or DataFrame as inputs to H2O algorithms
Use of H2O frames as inputs into MLlib algorithms (will come in handy when we do feature engineering later)
Transparent execution of Sparkling Water applications on top of Spark (for example, we can run a Sparkling Water application within a Spark stream)
The H2O user interface to explore Spark data

Design of Sparkling Water

Sparkling Water is designed to be executed as a regular Spark application. Consequently, it is launched inside a Spark executor created after submitting the application. At this point, H2O starts services, including a distributed key-value (K/V) store and memory manager, and orchestrates them into a cloud. The topology of the created cloud follows the topology of the underlying Spark cluster.

As stated previously, Sparkling Water enables transformation between different types of RDDs/DataFrames and H2O's frame, and vice versa. When converting from a hex frame to an RDD, a wrapper is created around the hex frame to provide an RDD-like API. In this case, data is not duplicated but served directly from the underlying hex frame. Converting from an RDD/DataFrame to a H2O frame requires data duplication because it transforms data from Spark into H2O-specific storage. However, data stored in an H2O frame is heavily compressed and does not need to be preserved as an RDD anymore:

spark-h2o-sparkling-water-machine-learning-needs-img-0