Chapter 1, Practical Machine Learning with Spark Using Scala, covers installing and configuring a real-life development environment with machine learning and programming with Apache Spark. Using screenshots, it walks you through downloading, installing, and configuring Apache Spark and IntelliJ IDEA along with the necessary libraries that would reflect a developer’s desktop in a real-world setting. It then proceeds to identify and list over 40 data repositories with real-world data sets that can help the reader in experimenting and advancing even further with the code recipes. In the final step, we run our first ML program on Spark and then provide directions on how to add graphics to your machine learning programs, which are used in the subsequent chapters.
Chapter 2, Just Enough Linear Algebra for Machine Learning with Spark, covers the use of linear algebra (vector and matrix), which is the foundation of some of the most monumental works in machine learning. It provides a comprehensive treatment of the DenseVector, SparseVector, and matrix facilities available in Apache Spark, with the recipes in the chapter. It provides recipes for both local and distributed matrices, including RowMatrix, IndexedRowMatrix, CoordinateMatrix, and BlockMatrix to provide a detailed explanation of this topic. We included this chapter because mastery of the Spark and ML/MLlib was only possible by reading most of the source code line by line and understanding how the matrix decomposition and vector/matrix arithmetic work underneath the more course-grain algorithm in Spark.Â
Chapter 3, Spark’s Three Data Musketeers for Machine Learning - Perfect Together, provides an end-to-end treatment of the three pillars of resilient distributed data manipulation and wrangling in Apache spark. The chapter comprises detailed recipes covering RDDs, DataFrame, and Dataset facilities from a practitioner’s point of view. Through an exhaustive list of 17 recipes, examples, references, and explanation, it lays out the foundation to build a successful career in machine learning sciences. The chapter provides both functional (code) as well as non-functional (SQL interface) programming approaches to solidify the knowledge base reflecting the real demands of a successful Spark ML engineer at tier 1 companies.
Chapter 4, Common Recipes for Implementing a Robust Machine Learning System, covers and factors out the tasks that are common in most machine learning systems through 16 short but to-the-point code recipes that the reader can use in their own real-world systems. It covers a gamut of techniques, ranging from normalizing data to evaluating the model output, using best practice metrics via Spark’s ML/MLlib facilities that might not be readily visible to the reader. It is a combination of recipes that we use in our day-to-day jobs in most situations but are listed separately to save on space and complexity of other recipes.
Chapter 5, Practical Machine Learning with Regression and Classification in Spark 2.0 - Part I, is the first of two chapters exploring classification and regression in Apache Spark. This chapter starts with Generalized Linear Regression (GLM) extending it to Lasso, Ridge with different types of optimization available in Spark. The chapter then proceeds to cover Isotonic regression, Survival regression with multi-layer perceptron (neural networks) and One-vs-Rest classifier.
Chapter 6, Practical Machine Learning with Regression and Classification in Spark 2.0 - Part II, is the second of the two regression and classification chapters. This chapter covers RDD-based regression systems, ranging from Linear, Logistic, and Ridge to Lasso, using Stochastic Gradient Decent and L_BFGS optimization in Spark. The last three recipes cover Support Vector Machine (SVM) and Naïve Bayes, ending with a detailed recipe for ML pipelines that are gaining a prominent position in the Spark ML ecosystem.
Chapter 7, Recommendation Engine that Scales with Spark, covers how to explore your data set and build a movie recommendation engine using Spark’s ML library facilities. It uses a large dataset and some recipes in addition to figures and write-ups to explore the various methods of recommenders before going deep into collaborative filtering techniques in Spark.
Chapter 8, Unsupervised Clustering with Apache Spark 2.0, covers the techniques used in unsupervised learning, such as KMeans, Mixture, and Expectation (EM), Power Iteration Clustering (PIC), and Latent Dirichlet Allocation (LDA), while also covering the why and how to help the reader to understand the core concepts. Using Spark Streaming, the chapter commences with a real-time KMeans clustering recipe to classify the input stream into labeled classes via unsupervised means.
Chapter 9, Optimization - Going Down the Hill with Gradient Descent, is a unique chapter that walks you through optimization as it applies to machine learning. It starts from a closed form formula and quadratic function optimization (for example, cost function), to using Gradient Descent (GD) in order to solve a regression problem from scratch. The chapter helps to look underneath the hood by developing the reader’s skill set using Scala code while providing in-depth explanation of how to code and understand Stochastic Descent (GD) from scratch. The chapter concludes with one of Spark’s ML API to achieve the same concepts that we code from scratch.
Chapter 10, Building Machine Learning Systems with Decision Tree and Ensemble Models, covers the Tree and Ensemble models for classification and regression in depth using Spark’s machine library. We use three real-world data sets to explore the classification and regression problems using Decision Tree, Random Forest Tree, and Gradient Boosted Tree. The chapter provides an in-depth explanation of these methods in addition to plug-and-play code recipes that explore Apache Spark’s machine library step by step.
Chapter 11, The Curse of High-Dimensionality in Big Data, demystifies the art and science of dimensionality reduction and provides a complete coverage of Spark’s ML/MLlib library, which facilitates this important concept in machine learning at scale. The chapter provides sufficient and in-depth coverage of the theory (the what and why) and then proceeds to cover two fundamental techniques available (the how) in Spark for the readers to use. The chapter covers Single Value Decomposition (SVD), which relates well with the second chapter and then proceeds to examine the Principal Component Analysis (PCA) in depth with code and write ups.
Chapter 12, Implementing Text Analytics with Spark 2.0 ML Library, covers the various techniques available in Spark for implementing text analytics at scale. It provides a comprehensive treatment by starting from the basics, such as Term Frequency (TF) and similarity techniques, such as Word2Vec, and moves on to analyzing a complete dump of Wikipedia for a real-life Spark ML project. The chapter concludes with an in-depth discussion and code for implementing Latent Semantic Analysis (LSA) and Topic Modeling with Latent Dirichlet Allocation (LDA) in Spark.
Chapter 13, Spark Streaming and Machine Learning Library, starts by providing an introduction to and the future direction of Spark streaming, and then proceeds to provide recipes for both RDD-based (DStream) and structured streaming to establish a baseline. The chapter then proceeds to cover all the available ML streaming algorithms in Spark at the time of writing this book. The chapter provides code and shows how to implement streaming DataFrame and streaming data sets, and then proceeds to cover queueStream for debugging before it goes into Streaming KMeans (unsupervised learning) and streaming linear models such as Linear and Logistic regression using real-world datasets.