Packt+ | Advance your knowledge in tech

You're reading from Apache Spark 2.x Machine Learning Cookbook

Product type Book

Published in Sep 2017

Publisher Packt

ISBN-13 9781783551606

Pages 666 pages

Edition 1st Edition

Languages

Scala

Concepts

Machine Learning

Authors (5):

Mohammed Guller

Siamak Amirghodsi

Shuen Mei

Meenakshi Rajendran

Broderick Hall

View More author details

Table of Contents (20) Chapters

Title Page

Credits

About the Authors

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

1. Practical Machine Learning with Spark Using Scala

2. Just Enough Linear Algebra for Machine Learning with Spark

3. Spark's Three Data Musketeers for Machine Learning - Perfect Together

4. Common Recipes for Implementing a Robust Machine Learning System

5. Practical Machine Learning with Regression and Classification in Spark 2.0 - Part I

6. Practical Machine Learning with Regression and Classification in Spark 2.0 - Part II

7. Recommendation Engine that Scales with Spark

8. Unsupervised Clustering with Apache Spark 2.0

9. Optimization - Going Down the Hill with Gradient Descent

10. Building Machine Learning Systems with Decision Tree and Ensemble Models

11. Curse of High-Dimensionality in Big Data

12. Implementing Text Analytics with Spark 2.0 ML Library

13. Spark Streaming and Machine Learning Library

Singular Value Decomposition (SVD) to reduce high-dimensionality in Spark

In this recipe, we will explore a dimensionality method straight out of the linear algebra, which is called SVD (Singular Value Decomposition). The key focus here is to come up with a set of low-rank matrices (typically three) that approximates the original matrix but with much less data, rather than choosing to work with a large M by N matrix.

SVD is a simple linear algebra technique that transforms the original data to eigenvector/eigenvalue low rank matrices that can capture most of the attributes (the original dimensions) in a much more efficient low rank matrix system.

The following figure depicts how SVD can be used to reduce dimensions and then use the S matrix to keep or eliminate higher-level concepts derived from the original data (that is, a low rank matrix with fewer columns/features than the original):