Subscription

Explore Products

Best Sellers

New Releases

Books

Videos

Audiobooks

Learning Hub

Newsletter Hub

Free Learning

You're reading from Apache Spark 2.x Machine Learning Cookbook Over 100 recipes to simplify machine learning model implementations with Spark

Product type Paperback

Published in Sep 2017

Publisher Packt

ISBN-13 9781783551606

Length 666 pages

Edition 1st Edition

Languages

Scala

Tools

Apache Spark

Concepts

Machine Learning

Authors (5):

Broderick Hall

Meenakshi Rajendran

Shuen Mei

Mohammed Guller

Siamak Amirghodsi

+1 more

View More author details

Table of Contents (14) Chapters

Preface

1. Practical Machine Learning with Spark Using Scala FREE CHAPTER

2. Just Enough Linear Algebra for Machine Learning with Spark

3. Spark's Three Data Musketeers for Machine Learning - Perfect Together

4. Common Recipes for Implementing a Robust Machine Learning System

5. Practical Machine Learning with Regression and Classification in Spark 2.0 - Part I

6. Practical Machine Learning with Regression and Classification in Spark 2.0 - Part II

7. Recommendation Engine that Scales with Spark

8. Unsupervised Clustering with Apache Spark 2.0

9. Optimization - Going Down the Hill with Gradient Descent

10. Building Machine Learning Systems with Decision Tree and Ensemble Models

11. Curse of High-Dimensionality in Big Data

12. Implementing Text Analytics with Spark 2.0 ML Library

13. Spark Streaming and Machine Learning Library

Principal Component Analysis (PCA) to pick the most effective latent factor for machine learning in Spark

In this recipe, we use PCA (Principal Component Analysis) to map the higher-dimension data (the apparent dimensions) to a lower-dimensional space (actual dimensions). It is hard to believe, but PCA has its root as early as 1901(see K. Pearson's writings) and again independently in the 1930s by H. Hotelling.

PCA attempts to pick new components in a manner that maximizes the variance along perpendicular axes and effectively transforms high-dimensional original features to a lower-dimensional space with derived components that can explain the variation (discriminate classes) in a more concise form.

The intuition beyond PCA is depicted in the following figure. Let's assume for now that our data has two dimensions (x, y) and the question we are going to ask the data is...

The rest of the chapter is locked

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $19.99/month. Cancel anytime

Authors (5)

Amirghodsi

Siamak Amirghodsi (Sammy) is interested in building advanced technical teams, executive management, Spark, Hadoop, big data analytics, AI, deep learning nets, TensorFlow, cognitive models, swarm algorithms, real-time streaming systems, quantum computing, financial risk management, trading signal discovery, econometrics, long-term financial cycles, IoT, blockchain, probabilistic graphical models, cryptography, and NLP.

See other products by Amirghodsi

Mohammed Guller

Author of Big Data Analytics with Spark - http://www.apress.com/9781484209653

See other products by Mohammed Guller

Shuen Mei

Shuen Mei is a big data analytic platforms expert with 15+ years of experience in designing, building, and executing large-scale, enterprise-distributed financial systems with mission-critical low-latency requirements. He is certified in the Apache Spark, Cloudera Big Data platform, including Developer, Admin, and HBase. He is also a certified AWS solutions architect with emphasis on peta-byte range real-time data platform systems.

See other products by Shuen Mei

Rajendran

Cedric Rajendran is a senior staff engineer in technical support with VMware. He has around 13 years of experience covering a wide spectrum of technologies. He holds a master's degree specializing in International Business. He has served in the fields of Network Ops, Technical Support, and Consulting. His core strengths are on the server and storage virtualization. He has authored a book on VMware Virtual SAN, holds advanced certifications with VMware, and is also a TOGAF certified Enterprise Architect.

See other products by Rajendran

Hall

Broderick Hall is a hands-on big data analytics expert and holds a masters degree in computer science with 20 years of experience in designing and developing complex enterprise-wide software applications with real-time and regulatory requirements at a global scale. He is a deep learning early adopter and is currently working on a large-scale cloud-based data platform with deep learning net augmentation.

See other products by Hall