What this book covers
Chapter 1, Opportunities and Challenges, sets the context of the book. We will first define ML at scale around three areas: building high-quality models on large to massive datasets, deploying them for scoring in diverse enterprise environments, and navigating multiple stakeholder concerns along the way. We will then recognize the vast business opportunities and execution challenges of ML in this context. In this light, you will be introduced to how H2O overcomes these challenges with its H2O-3, Sparkling Water, Enterprise Steam, and MOJO technologies that form the H2O at Scale framework.
Chapter 2, Platform Components and Key Concepts, overviews each H2O component by describing where it fits in the ML life cycle, what its key features are, and how it overcomes the challenges of ML at scale. We then distill several key concepts from this overview. The goal of this chapter is to provide you with a foundational knowledge of how H2O at Scale works before you learn how to implement it.
Chapter 3, Fundamental Workflow – Data to Deployable Model, shows the minimal steps needed to build and deploy models with the H2O at Scale framework. Think of this as a Hello World
example, with each step explained. You have alternatives to implementing these steps, and they will be explored. At this point in the book, we will end our general overview and move on to advanced topics.
Chapter 4, H2O Model Building at Scale – Capability Articulation, starts our model-building focus and is of interest primarily to data scientists. In this chapter, we familiarize ourselves with H2O's extensive range of modeling capabilities, from data ingestion and manipulation to algorithms, model training, evaluation, and explainability techniques. Think of this chapter as the what of H2O model building, and the next chapters as an advanced treatment of the how and why.
Chapter 5, Advanced Model Building – Part 1, introduces you to the advanced model-building topics that a data scientist considers when building enterprise-grade models. We discuss data-splitting options, compare modeling algorithms, present a two-stage grid-search strategy for hyperparameter optimization, introduce H2O AutoML for automatically fitting multiple algorithms to data, and investigate feature engineering options for improving model performance. By the end of this chapter, you should be able to build an enterprise-scale, optimized, and predictive model using one or more supervised learning algorithms available within H2O.
Chapter 6, Advanced Model Building – Part II, continues our advanced model-building topics by showing how to build H2O supervised learning models within an Apache Spark pipeline, reviewing H2O's unsupervised learning methods, discussing best practices for updating H2O models, and introducing requirements to ensure H2O model reproducibility.
Chapter 7, Understanding ML Models, outlines a set of capabilities within H2O for explaining ML models. Building a model that predicts well is not enough. A critical step before putting any model into production is understanding how it makes decisions. We discuss selecting appropriate model metrics, using multiple diagnostics to build trust in a model, and using global and local explanations with model performance metrics to choose the best among a set of candidate models. This includes an evaluation of tradeoffs between model performance, speed of scoring, and assumptions met in a candidate model.
Chapter 8, Putting It All Together, starts the way most data science projects do: with raw data and a general business objective. We refine both the data and problem statement to be one that is relevant to the business and can be answered by the available data. We engineer a variety of features, creating and evaluating multiple candidate models until we arrive at a final model. We evaluate the final model and illustrate the preparation steps required for model deployment. The treatment in this chapter accurately reflects the job of a data scientist in the enterprise.
Chapter 9, Production Scoring and the H2O MOJO, starts our focus on model deployment. ML engineers, enterprise architects, software developers, and general technologists will be particularly interested in this chapter. You will become familiar with the strengths of H2O's MOJO as a scoring artifact, and how easily it can be deployed to a great diversity of enterprise systems. You will finish by writing a batch file scoring program that embeds a MOJO to demonstrate this flexibility.
Chapter 10, H2O Model Deployment Patterns, explores the many ways a MOJO can be deployed. You will first overview a diverse sampling of possible deployment patterns, and then drill down to implementation details of each. The patterns cover real-time streaming and batch scoring on a variety of specialized H2O scoring software, third-party integrations, and your own custom-built systems.
Chapter 11, The Administrator and Operations Views, starts our focus on enterprise stakeholder perspectives of ML at scale with H2O. Although focused on enterprise stakeholder activities and concerns, data scientists are shown how they relate to their own activities. In this chapter, system administrators and operators will learn in detail how Enterprise Steam is configured, and how users are secured and managed so data scientists can self-provision environments in a governed way. We will also identify operations activities around maintaining and troubleshooting H2O workloads and components.
Chapter 12, The Enterprise Architect and Security Views, covers the enterprise architect and security perspectives of H2O at Scale components. You will understand in detail the implementation alternatives of H2O and how the components integrate, communicate, and deploy. You will see that the H2O at Scale framework can be deployed on its own or as a member of the much larger H2O AI Cloud, which we cover in the next chapter.
Chapter 13, Introducing H2O AI Cloud, overviews H2O.ai's full end-to-end ML life cycle platform. The H2O at Scale framework and everything covered in the book to this point is a smaller subset of the H2O AI Cloud. In this chapter, we will overview H2O AI components and their key features, including four specialized model-building engines, a full-featured MLOps and Feature Store, and an open source low-code SDK to build and integrate AI Apps and host them on an Appstore.
Chapter 14, H2O at Scale in a Larger Platform Context, finishes the book by taking everything we have learned and showing how the H2O at Scale framework acquires categorically new and exciting possibilities when used as a part of the H2O AI Cloud. We provide examples of these possibilities and then present a reference enterprise integration framework using H2O for you to imagine your own possibilities.
Appendix, Alternative Methods to Launch H2O Clusters, shows different ways you can create H2O environments to run the code samples in this book.