Subscription

Explore Products

Best Sellers

New Releases

Books

Videos

Audiobooks

Learning Hub

Newsletter Hub

Free Learning

You're reading from Apache Spark 2.x Machine Learning Cookbook Over 100 recipes to simplify machine learning model implementations with Spark

Product type Paperback

Published in Sep 2017

Publisher Packt

ISBN-13 9781783551606

Length 666 pages

Edition 1st Edition

Languages

Scala

Tools

Apache Spark

Concepts

Machine Learning

Authors (5):

Broderick Hall

Meenakshi Rajendran

Shuen Mei

Mohammed Guller

Siamak Amirghodsi

+1 more

View More author details

Table of Contents (14) Chapters

Preface

1. Practical Machine Learning with Spark Using Scala FREE CHAPTER

2. Just Enough Linear Algebra for Machine Learning with Spark

3. Spark's Three Data Musketeers for Machine Learning - Perfect Together

4. Common Recipes for Implementing a Robust Machine Learning System

5. Practical Machine Learning with Regression and Classification in Spark 2.0 - Part I

6. Practical Machine Learning with Regression and Classification in Spark 2.0 - Part II

7. Recommendation Engine that Scales with Spark

8. Unsupervised Clustering with Apache Spark 2.0

9. Optimization - Going Down the Hill with Gradient Descent

10. Building Machine Learning Systems with Decision Tree and Ensemble Models

11. Curse of High-Dimensionality in Big Data

12. Implementing Text Analytics with Spark 2.0 ML Library

13. Spark Streaming and Machine Learning Library

Topic modeling with Latent Dirichlet allocation in Spark 2.0

In this recipe, we will be demonstrating topic model generation by utilizing Latent Dirichlet Allocation to infer topics from a collection of documents.

We have covered LDA in previous chapters as it applies to clustering and topic modelling, but in this chapter, we demonstrate a more elaborate example to show its application to text analytics using more real-life and complex datasets.

We also apply NLP techniques such as stemming and stop words to provide a more realistic approach to LDA problem-solving. What we are trying to do is to discover a set of latent factors (that is, different from the original) that can solve and describe the solution in a more efficient way in a reduced computational space.

The first question that always comes up when using LDA and topic modelling is what is Dirichlet? Dirichlet is...

The rest of the chapter is locked

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $19.99/month. Cancel anytime

Authors (5)

Amirghodsi

Siamak Amirghodsi (Sammy) is interested in building advanced technical teams, executive management, Spark, Hadoop, big data analytics, AI, deep learning nets, TensorFlow, cognitive models, swarm algorithms, real-time streaming systems, quantum computing, financial risk management, trading signal discovery, econometrics, long-term financial cycles, IoT, blockchain, probabilistic graphical models, cryptography, and NLP.

See other products by Amirghodsi

Mohammed Guller

Author of Big Data Analytics with Spark - http://www.apress.com/9781484209653

See other products by Mohammed Guller

Shuen Mei

Shuen Mei is a big data analytic platforms expert with 15+ years of experience in designing, building, and executing large-scale, enterprise-distributed financial systems with mission-critical low-latency requirements. He is certified in the Apache Spark, Cloudera Big Data platform, including Developer, Admin, and HBase. He is also a certified AWS solutions architect with emphasis on peta-byte range real-time data platform systems.

See other products by Shuen Mei

Rajendran

Cedric Rajendran is a senior staff engineer in technical support with VMware. He has around 13 years of experience covering a wide spectrum of technologies. He holds a master's degree specializing in International Business. He has served in the fields of Network Ops, Technical Support, and Consulting. His core strengths are on the server and storage virtualization. He has authored a book on VMware Virtual SAN, holds advanced certifications with VMware, and is also a TOGAF certified Enterprise Architect.

See other products by Rajendran

Hall

Broderick Hall is a hands-on big data analytics expert and holds a masters degree in computer science with 20 years of experience in designing and developing complex enterprise-wide software applications with real-time and regulatory requirements at a global scale. He is a deep learning early adopter and is currently working on a large-scale cloud-based data platform with deep learning net augmentation.

See other products by Hall