Modeling in Sparkling Water
We saw in Chapter 2, Platform Components and Key Concepts, that Sparkling Water is simply H2O-3 in an Apache Spark environment. From the Python coder's point of view, H2O-3 code is virtually identical to Sparkling Water code. If the code is the same, why have a separate section for modeling in Sparkling Water? There are two important reasons, as outlined here:
- Sparkling Water enables data scientists to leverage Spark's extensive data processing capabilities.
- Sparkling Water provides access to production Spark pipelines. We expand upon these reasons next.
Spark is rightly known for its data operations that effortlessly scale with increasing data volume. Since the presence of Spark in an enterprise setting is now almost a given, data scientists should add Spark to their skills toolbelt. This is not nearly as hard as it seems, since Spark can be operated from Python (using PySpark) with data operations written primarily in Spark...