Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Mastering Machine Learning with Spark 2.x

You're reading from   Mastering Machine Learning with Spark 2.x Harness the potential of machine learning, through spark

Arrow left icon
Product type Paperback
Published in Aug 2017
Publisher Packt
ISBN-13 9781785283451
Length 340 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (3):
Arrow left icon
Alex Tellez Alex Tellez
Author Profile Icon Alex Tellez
Alex Tellez
Michal Malohlava Michal Malohlava
Author Profile Icon Michal Malohlava
Michal Malohlava
Max Pumperla Max Pumperla
Author Profile Icon Max Pumperla
Max Pumperla
Arrow right icon
View More author details
Toc

Introducing H2O.ai

H2O is an open source, machine learning platform that plays extremely well with Spark; in fact, it was one of the first third-party packages deemed "Certified on Spark".

Sparkling Water (H2O + Spark) is H2O's integration of their platform within the Spark project, which combines the machine learning capabilities of H2O with all the functionality of Spark. This means that users can run H2O algorithms on Spark RDD/DataFrame for both exploration and deployment purposes. This is made possible because H2O and Spark share the same JVM, which allows for seamless transitions between the two platforms. H2O stores data in the H2O frame, which is a columnar-compressed representation of your dataset that can be created from Spark RDD and/or DataFrame. Throughout much of this book, we will be referencing algorithms from Spark's MLlib library and H2O's platform, showing how to use both the libraries to get the best results possible for a given task.

The following is a summary of the features Sparkling Water comes equipped with:

  • Use of H2O algorithms within a Spark workflow
  • Transformations between Spark and H2O data structures
  • Use of Spark RDD and/or DataFrame as inputs to H2O algorithms
  • Use of H2O frames as inputs into MLlib algorithms (will come in handy when we do feature engineering later)
  • Transparent execution of Sparkling Water applications on top of Spark (for example, we can run a Sparkling Water application within a Spark stream)
  • The H2O user interface to explore Spark data

Design of Sparkling Water

Sparkling Water is designed to be executed as a regular Spark application. Consequently, it is launched inside a Spark executor created after submitting the application. At this point, H2O starts services, including a distributed key-value (K/V) store and memory manager, and orchestrates them into a cloud. The topology of the created cloud follows the topology of the underlying Spark cluster.

As stated previously, Sparkling Water enables transformation between different types of RDDs/DataFrames and H2O's frame, and vice versa. When converting from a hex frame to an RDD, a wrapper is created around the hex frame to provide an RDD-like API. In this case, data is not duplicated but served directly from the underlying hex frame. Converting from an RDD/DataFrame to a H2O frame requires data duplication because it transforms data from Spark into H2O-specific storage. However, data stored in an H2O frame is heavily compressed and does not need to be preserved as an RDD anymore:

Data sharing between sparkling water and Spark
lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime