Subscription

Explore Products

Best Sellers

New Releases

Books

Videos

Audiobooks

Learning Hub

Free Learning

You're reading from Machine Learning with Spark Develop intelligent, distributed machine learning systems

Product type Paperback

Published in Apr 2017

Publisher Packt

ISBN-13 9781785889936

Length 532 pages

Edition 2nd Edition

Languages

Scala

Tools

Apache Spark

Concepts

Machine Learning

Authors (2):

Manpreet Singh Ghotra

Rajdeep Dua

View More author details

Table of Contents (13) Chapters

Preface

1. Getting Up and Running with Spark FREE CHAPTER

2. Math for Machine Learning

3. Designing a Machine Learning System

4. Obtaining, Processing, and Preparing Data with Spark

5. Building a Recommendation Engine with Spark

6. Building a Classification Model with Spark

7. Building a Regression Model with Spark

8. Building a Clustering Model with Spark

9. Dimensionality Reduction with Spark

10. Advanced Text Processing with Spark

11. Real-Time Machine Learning with Spark Streaming

12. Pipeline APIs for Spark ML

Text classification with Spark 2.0

In this section, we will use the libsvm version of 20newsgroup data to use the Spark DataFrame-based APIs to classify the text documents. In the current version of Spark libsvm version 3.22 is supported (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/)

Download the libsvm formatted data from the following link and copy output folder under Spark-2.0.x.

Visit the following link for the 20newsgroup libsvm data: https://1drv.ms/f/s!Av6fk5nQi2j-iF84quUlDnJc6G6D

Import the appropriate packages from org.apache.spark.ml and create Wrapper Scala:

package org.apache.spark.examples.ml 

import org.apache.spark.SparkConf 
import org.apache.spark.ml.classification.NaiveBayes 
import        
          
org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator 

import org.apache.spark.sql.SparkSession 

object DocumentClassificationLibSVM { 
  def main(args: Array[String]): Unit = { 

  ...

The rest of the chapter is locked

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $19.99/month. Cancel anytime

Authors (2)

Ghotra

Manpreet Singh Ghotra has more than 15 years experience in software development for both enterprise and big data software. He is currently working at Salesforce on developing a machine learning platform/APIs using open source libraries and frameworks such as Keras, Apache Spark, and TensorFlow. He has worked on various machine learning systems, including sentiment analysis, spam detection, and anomaly detection. He was part of the machine learning group at one of the largest online retailers in the world, working on transit time calculations using Apache Mahout, and the R recommendation system, again using Apache Mahout. With a master's and postgraduate degree in machine learning, he has contributed to, and worked for, the machine learning community.

See other products by Ghotra

Dua

Rajdeep Dua has over 18 years experience in the cloud and big data space. He has taught Spark and big data at some of the most prestigious tech schools in India: IIIT Hyderabad, ISB, IIIT Delhi, and Pune College of Engineering. He currently leads the developer relations team at Salesforce India. He has also presented BigQuery and Google App Engine at the W3C conference in Hyderabad. He led the developer relations teams at Google, VMware, and Microsoft, and has spoken at hundreds of other conferences on the cloud. Some of the other references to his work can be seen at Your Story and on ACM digital library. His contributions to the open source community relate to Docker, Kubernetes, Android, OpenStack, and Cloud Foundry.

See other products by Dua