Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Machine Learning with Spark

You're reading from   Machine Learning with Spark Develop intelligent, distributed machine learning systems

Arrow left icon
Product type Paperback
Published in Apr 2017
Publisher Packt
ISBN-13 9781785889936
Length 532 pages
Edition 2nd Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Manpreet Singh Ghotra Manpreet Singh Ghotra
Author Profile Icon Manpreet Singh Ghotra
Manpreet Singh Ghotra
Rajdeep Dua Rajdeep Dua
Author Profile Icon Rajdeep Dua
Rajdeep Dua
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Getting Up and Running with Spark FREE CHAPTER 2. Math for Machine Learning 3. Designing a Machine Learning System 4. Obtaining, Processing, and Preparing Data with Spark 5. Building a Recommendation Engine with Spark 6. Building a Classification Model with Spark 7. Building a Regression Model with Spark 8. Building a Clustering Model with Spark 9. Dimensionality Reduction with Spark 10. Advanced Text Processing with Spark 11. Real-Time Machine Learning with Spark Streaming 12. Pipeline APIs for Spark ML

The first step to a Spark program in R

SparkR is an R package which provides a frontend to use Apache Spark from R. In Spark 1.6.0; SparkR provides a distributed data frame on large datasets. SparkR also supports distributed machine learning using MLlib. This is something you should try out while reading machine learning chapters.

SparkR DataFrames

DataFrame is a collection of data organized into names columns that are distributed. This concept is very similar to a relational database or a data frame of R but with much better optimizations. Source of these data frames could be a CSV, a TSV, Hive tables, local R data frames, and so on.

Spark distribution can be run using the ./bin/sparkR shell.

Following on from the preceding examples, we will now write an R version. We assume that you have R (R version 3.0.2 (2013-09-25)-Frisbee Sailing), R Studio and higher installed on your system (for example, most Linux and Mac OS X systems come with Python preinstalled).

The example program is included in the sample code for this chapter, in the directory named r-spark-app, which also contains the CSV data file under the data subdirectory. The project contains a script, r-script-01.R, which is provided in the following. Make sure you change PATH to appropriate value for your environment.

Sys.setenv(SPARK_HOME = "/PATH/spark-2.0.0-bin-hadoop2.7") 
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"),
.libPaths()))
#load the Sparkr library
library(SparkR)
sc <- sparkR.init(master = "local", sparkPackages="com.databricks:spark-csv_2.10:1.3.0")
sqlContext <- sparkRSQL.init(sc)

user.purchase.history <- "/PATH/ml-resources/spark-ml/Chapter_01/r-spark-app/data/UserPurchaseHistory.csv"
data <- read.df(sqlContext, user.purchase.history, "com.databricks.spark.csv", header="false")
head(data)
count(data)

parseFields <- function(record) {
Sys.setlocale("LC_ALL", "C") # necessary for strsplit() to work correctly
parts <- strsplit(as.character(record), ",")
list(name=parts[1], product=parts[2], price=parts[3])
}

parsedRDD <- SparkR:::lapply(data, parseFields)
cache(parsedRDD)
numPurchases <- count(parsedRDD)

sprintf("Number of Purchases : %d", numPurchases)
getName <- function(record){
record[1]
}

getPrice <- function(record){
record[3]
}

nameRDD <- SparkR:::lapply(parsedRDD, getName)
nameRDD = collect(nameRDD)
head(nameRDD)

uniqueUsers <- unique(nameRDD)
head(uniqueUsers)

priceRDD <- SparkR:::lapply(parsedRDD, function(x) { as.numeric(x$price[1])})
take(priceRDD,3)

totalRevenue <- SparkR:::reduce(priceRDD, "+")

sprintf("Total Revenue : %.2f", s)

products <- SparkR:::lapply(parsedRDD, function(x) { list( toString(x$product[1]), 1) })
take(products, 5)
productCount <- SparkR:::reduceByKey(products, "+", 2L)
productsCountAsKey <- SparkR:::lapply(productCount, function(x) { list( as.integer(x[2][1]), x[1][1])})

productCount <- count(productsCountAsKey)
mostPopular <- toString(collect(productsCountAsKey)[[productCount]][[2]])
sprintf("Most Popular Product : %s", mostPopular)

Run the script with the following command on the bash terminal:

  $ Rscript r-script-01.R 

Your output will be similar to the following listing:

> sprintf("Number of Purchases : %d", numPurchases)
[1] "Number of Purchases : 5"

> uniqueUsers <- unique(nameRDD)
> head(uniqueUsers)
[[1]]
[[1]]$name
[[1]]$name[[1]]
[1] "John"
[[2]]
[[2]]$name
[[2]]$name[[1]]
[1] "Jack"
[[3]]
[[3]]$name
[[3]]$name[[1]]
[1] "Jill"
[[4]]
[[4]]$name
[[4]]$name[[1]]
[1] "Bob"

> sprintf("Total Revenue : %.2f", totalRevenueNum)
[1] "Total Revenue : 39.91"

> sprintf("Most Popular Product : %s", mostPopular)
[1] "Most Popular Product : iPad Cover"
You have been reading a chapter from
Machine Learning with Spark - Second Edition
Published in: Apr 2017
Publisher: Packt
ISBN-13: 9781785889936
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image