Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Apache Spark Machine Learning Blueprints
Apache Spark Machine Learning Blueprints

Apache Spark Machine Learning Blueprints: Develop a range of cutting-edge machine learning projects with Apache Spark using this actionable guide

Arrow left icon
Profile Icon Alex Liu
Arrow right icon
Mex$179.99 Mex$721.99
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1 (1 Ratings)
eBook May 2016 252 pages 1st Edition
eBook
Mex$179.99 Mex$721.99
Paperback
Mex$902.99
Subscription
Free Trial
Arrow left icon
Profile Icon Alex Liu
Arrow right icon
Mex$179.99 Mex$721.99
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1 (1 Ratings)
eBook May 2016 252 pages 1st Edition
eBook
Mex$179.99 Mex$721.99
Paperback
Mex$902.99
Subscription
Free Trial
eBook
Mex$179.99 Mex$721.99
Paperback
Mex$902.99
Subscription
Free Trial

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Apache Spark Machine Learning Blueprints

Chapter 1. Spark for Machine Learning

This chapter provides an introduction to Apache Spark from a Machine Learning (ML) and data analytics perspective, and also discusses machine learning in relation to Spark computing. Here, we first present an overview of Apache Spark, as well as Spark's advantages for data analytics, in comparison to MapReduce and other computing platforms. Then we discuss five main issues, as below:

  • Machine learning algorithms and libraries
  • Spark RDD and dataframes
  • Machine learning frameworks
  • Spark pipelines
  • Spark notebooks

All of the above are the most important topics that any data scientist or machine learning professional is expected to master, in order to fully take advantage of Apache Spark computing. Specifically, this chapter will cover all of the following six topics.

  • Spark overview and Spark advantages
  • ML algorithms and ML libraries for Spark
  • Spark RDD and dataframes
  • ML Frameworks, RM4Es and Spark computing
  • ML workflows and Spark pipelines
  • Spark notebooks introduction

Spark overview and Spark advantages

In this section, we provide an overview of the Apache Spark computing platform and a discussion about some advantages of utilizing Apache Spark, in comparison to using other computing platforms like MapReduce. Then, we briefly discuss how Spark computing fits modern machine learning and big data analytics.

After this section, readers will form a basic understanding of Apache Spark as well as a good understanding of some important machine learning benefits from utilizing Apache Spark.

Spark overview

Apache Spark is a computing framework for the fast processing of big data. This framework contains a distributed computing engine and a specially designed programming model. Spark was started as a research project at the AMPLab of the University of California at Berkeley in 2009, and then in 2010 it became fully open sourced as it was donated to the Apache Software Foundation. Since then, Apache Spark has experienced exponential growth, and now Spark is the most active open source project in the big data field.

Spark's computing utilizes an in-memory distributed computational approach, which makes Spark computing among the fastest, especially for iterative computation. It can run up to 100 times faster than Hadoop MapReduce, according to many tests that have been performed.

Apache Spark has a unified platform, which consists of the Spark core engine and four libraries: Spark SQL, Spark Streaming, MLlib, and GraphX. All of these four libraries have Python, Java and Scala programming APIs.

Besides the above mentioned four built-in libraries, there are also tens of packages available for Apache Spark, provided by third parties, which can be used for handling data sources, machine learning, and other tasks.

Spark overview

Apache Spark has a 3 month circle for new releases, with Spark version 1.6.0 released on January 4 of 2016. Apache Spark release 1.3 had DataFrames API and ML Pipelines API included. Starting from Apache Spark release 1.4, the R interface (SparkR) is included as default.

Note

To download Apache Spark, readers should go to http://spark.apache.org/downloads.html.

To install Apache Spark and start running it, readers should consult its latest documentation at http://spark.apache.org/docs/latest/.

Spark advantages

Apache Spark has many advantages over MapReduce and other big data computing platforms. Among them, the distinguished two are that it is fast to run and fast to write.

Overall, Apache Spark has kept some of MapReduce's most important advantages like that of scalability and fault tolerance, but extended them greatly with new technologies.

In comparison to MapReduce, Apache Spark's engine is capable of executing a more general Directed Acyclic Graph (DAG) of operators. Therefore, when using Apache Spark to execute MapReduce-style graphs, users can achieve higher performance batch processing in Hadoop.

Apache Spark has in-memory processing capabilities, and uses a new data abstraction method, Resilient Distributed Dataset (RDD), which enables highly iterative computing and reactive applications. This also extended its fault tolerance capability.

At the same time, Apache Spark has made complex pipeline representation easy with only a few lines of code needed. It is best known for the ease with which it can be used to create algorithms that capture insight from complex and even messy data, and also enable users to apply that insight in-time to drive outcomes.

As summarized by the Apache Spark team, Spark enables:

  • Iterative algorithms in Machine Learning
  • Interactive data mining and data processing
  • Hive-compatible data warehousing that can run 100x faster
  • Stream processing
  • Sensor data processing

To a practical data scientist working with the above, Apache Spark easily demonstrates its advantages when it is adopted for:

  • Parallel computing
  • Interactive analytics
  • Complex computation

Most users are satisfied with Apache Spark's advantages in speed and performance, but some also noted that Apache Spark is still in the process of maturing.

Note

http://www.svds.com/use-cases-for-apache-spark/ has some examples of materialized Spark benefits.

Spark computing for machine learning

With its innovations on RDD and in-memory processing, Apache Spark has truly made distributed computing easily accessible to data scientists and machine learning professionals. According to the Apache Spark team, Apache Spark runs on the Mesos cluster manager, letting it share resources with Hadoop and other applications. Therefore, Apache Spark can read from any Hadoop input source like HDFS.

Spark computing for machine learning

For the above, the Apache Spark computing model is very suitable to distributed computing for machine learning. Especially for rapid interactive machine learning, parallel computing, and complicated modelling at scale, Apache Spark should definitely be utilized.

According to the Spark development team, Spark's philosophy is to make life easy and productive for data scientists and machine learning professionals. Due to this, Apache Spark has:

  • Well documented, expressive API's
  • Powerful domain specific libraries
  • Easy integration with storage systems
  • Caching to avoid data movement

Per the introduction by Patrick Wendell, co-founder of Databricks, Spark is especially made for large scale data processing. Apache Spark supports agile data science to iterate rapidly, and Spark can be integrated with IBM and other solutions easily.

Machine learning algorithms

In this section, we review algorithms that are needed for machine learning, and introduce machine learning libraries including Spark's MLlib and IBM's SystemML, then we discuss their integration with Apache Spark.

After reading this section, readers will become familiar with various machine learning libraries including Spark's MLlib, and know how to make them ready for machine learning.

To complete a Machine Learning project, data scientists often employ some classification or regression algorithms to develop and evaluate predictive models, which are readily available in some Machine Learning tools like R or MatLab. To complete a machine learning project, besides data sets and computing platforms, these machine learning libraries, as collections of machine learning algorithms, are necessary.

For example, the strength and depth of the popular R mainly comes from the various algorithms that are readily provided for the use of Machine Learning professionals. The total number of R packages is over 1000. Data scientists do not need all of them, but do need some packages to:

  • Load data, with packages like RODBC or RMySQL
  • Manipulate data, with packages like stringr or lubridate
  • Visualize data, with packages like ggplot2 or leaflet
  • Model data, with packages like Random Forest or survival
  • Report results, with packages like shiny or markdown

According to a recent ComputerWorld survey, the most downloaded R packages are:

PACKAGE

# of DOWNLOADS

Rcpp

162778

ggplot2

146008

plyr

123889

stringr

120387

colorspace

118798

digest

113899

reshape2

109869

RColorBrewer

100623

scales

92448

manipulate

88664

MLlib

MLlib is Apache Spark's machine learning library. It is scalable, and consists of many commonly-used machine learning algorithms. Built-in to MLlib are algorithms for:

  • Handling data types in forms of vectors and matrices
  • Computing basic statistics like summary statistics and correlations, as well as producing simple random and stratified samples, and conducting simple hypothesis testing
  • Performing classification and regression modeling
  • Collaborative filtering
  • Clustering
  • Performing dimensionality reduction
  • Conducting feature extraction and transformation
  • Frequent pattern mining
  • Developing optimization
  • Exporting PMML models

The Spark MLlib is still under active development, with new algorithms expected to be added for every new release.

In line with Apache Spark's computing philosophy, the MLlib is built for easy use and deployment, with high performance.

MLlib uses the linear algebra package Breeze, which depends on netlib-java, and jblas. The packages netlib-java and jblas also depend on native Fortran routines. Users need to install the gfortran runtime library if it is not already present on their nodes. MLlib will throw a linking error if it cannot detect these libraries automatically.

Note

For MLlib use cases and further details on how to use MLlib, please visit:

http://spark.apache.org/docs/latest/mllib-guide.html.

Other ML libraries

As discussed in previous part, MLlib has made available many frequently used algorithms like regression and classification. But these basics are not enough for complicated machine learning.

If we wait for the Apache Spark team to add all the needed ML algorithms it may take a long time. For this, the good news is that many third parties have contributed ML libraries to Apache Spark.

IBM has contributed its machine learning library, SystemML, to Apache Spark.

Besides what MLlib provides, SystemML offers a lot more additional ML algorithms like the ones on missing data imputation, SVM, GLM, ARIMA, and non-linear optimizers, and some graphical modelling and matrix factonization algorithms.

As developed by the IBM Almaden Research group, IBM's SystemML is an engine for distributed machine learning and it can scale to arbitrary large data sizes. It provides the following benefits:

  • Unifies the fractured machine learning environments
  • Gives the core Spark ecosystem a complete set of DML
  • Allows a data scientist to focus on the algorithm, not the implementation
  • Improves time to value for data science teams
  • Establishes a de facto standard for reusable machine learning routines

SystemML is modeled after R syntax and semantics, and provides the ability to author new algorithms via its own language.

Through a good integration with R by SparkR, Apache Spark users also have the potential to utilize thousands of R packages for machine learning algorithms, when needed. As will be discussed in later sections of this chapter, the SparkR notebook will make this operation very easy.

Spark RDD and dataframes

In this section, our focus turns to data and how Apache Spark represents data and organizes data. Here, we will provide an introduction to the Apache Spark RDD and Apache Spark dataframes.

After this section, readers will master these two fundamental Spark concepts, RDD and Spark dataframe, and be ready to utilize them for Machine Learning projects.

Spark RDD

Apache Spark's primary data abstraction is in the form of a distributed collection of items, which is called Resilient Distributed Dataset (RDD). RDD is Apache Spark's key innovation, which makes its computing faster and more efficient than others.

Specifically, an RDD is an immutable collection of objects, which spreads across a cluster. It is statically typed, for example RDD[T] has objects of type T. There are RDD of strings, RDD of integers, and RDD of objects.

On the other hand, RDDs:

  • Are collections of objects across a cluster with user controlled partitioning
  • Are built via parallel transformations like map and filter

That is, an RDD is physically distributed across a cluster, but manipulated as one logical entity. RDDs on Spark have fault tolerant properties such that they can be automatically rebuilt on failure.

New RDDs can be created from Hadoop Input Formats (such as HDFS files) or by transforming other RDDs.

To create RDDs, users can either:

  • Distribute a collection of objects from the driver program (using the parallelize method of the Spark context)
  • Load an external dataset
  • Transform an existing RDD

Spark's team call the above two types of RDD operations action and transformation.

RDDs can be operated by actions, which return values, or by transformations, which return pointers to new RDDs. Some examples of RDD actions are collect, count and take.

Transformations are lazy evaluations. Some examples of RDD transformations are map, filter, and join.

RDD actions and transformations may be combined to form complex computations.

Note

To learn more about RDD, please read the article at

https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Spark dataframes

A Spark dataframe is a distributed collection of data as organized by columns, actually a distributed collection of data as grouped into named columns, that is, an RDD with a schema. In other words, Spark dataframe is an extension of Spark RDD.

Data frame = RDD where columns are named and can be manipulated by name instead of by index value.

A Spark dataframe is conceptually equivalent to a dataframe in R, and is similar to a table in a relational database, which helped Apache Spark to be quickly accepted by the machine learning community. With Spark dataframes, users can directly work with data elements like columns, which are not available when working with RDDs. With data scheme knowledge on hand, users can also apply their familiar SQL types of data re-organization techniques to data. Spark dataframes can be built from many kinds of raw data such as structured relational data files, Hive tables, or existing RDDs.

Apache Spark has built a special dataframe API and a Spark SQL to deal with Spark dataframes. The Spark SQL and Spark dataframe API are both available for Scala, Java, Python, and R. As an extension to the existing RDD API, the DataFrames API features:

  • Ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster
  • Support for a wide array of data formats and storage systems
  • State-of-the-art optimization and code generation through the Spark SQL Catalyst optimizer
  • Seamless integration with all big data tooling and infrastructure via Spark

The Spark SQL works with Spark DataFrame very well, which allows users to do ETL easily, and also to work on subsets of any data easily. Then, users can transform them and make them available to other users including R users. Spark SQL can also be used alongside HiveQL, and runs very fast. With Spark SQL, users write less code as well, a lot less than working with Hadoop, and also less than working directly on RDDs.

Dataframes API for R

A dataframe is an essential element for machine learning programming. Apache Spark has made a dataframe API available for R as well as for Java and Python, so that users can operate Spark dataframes easily in their familiar environment with their familiar language. In this section, we provide a simple introduction to operating Spark dataframes, with some simple examples for R to start leading our readers into actions.

The entry point into all relational functionality in Apache Spark is its SQLContext class, or one of its descendents. To create a basic SQLContext, all users need is a SparkContext command as below:

sqlContext <- sparkRSQL.init(sc)

To create a Spark dataframe, users may perform the following:

sqlContext <- SQLContext(sc)
df <- jsonFile(sqlContext, "examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdout
showDF(df)

For Spark dataframe operations, the following are some examples:

sqlContext <- sparkRSQL.init(sc)
# Create the DataFrame
df <- jsonFile(sqlContext, "examples/src/main/resources/people.json")
# Show the content of the DataFrame
showDF(df)
## age  name
## null Michael
## 30   Andy
## 19   Justin

# Print the schema in a tree format
printSchema(df)
## root
## |-- age: long (nullable = true)
## |-- name: string (nullable = true)

# Select only the "name" column
showDF(select(df, "name"))
## name
## Michael
## Andy
## Justin

# Select everybody, but increment the age by 1
showDF(select(df, df
$name, df$age + 1))
## name    (age + 1)
## Michael null
## Andy    31
## Justin  20

# Select people older than 21
showDF(where(df, df$age > 21))
## age name
## 30  Andy
# Count people by age

showDF(count(groupBy(df, "age")))
## age  count
## null 1
## 19   1
## 30   1

ML frameworks, RM4Es and Spark computing

In this section, we discuss machine learning frameworks with RM4Es as one of its examples, in relation to Apache Spark computing.

After this section, readers will master the concept of machine learning frameworks and some examples, and then be ready to combine them with Spark computing for planning and implementing machine learning projects.

ML frameworks

As discussed in earlier sections, Apache Spark computing is very different from Hadoop MapReduce. Spark is faster and easier to use than Hadoop MapReduce. There are many benefits to adopting Apache Spark computing for machine learning.

However, all the benefits for machine learning professionals will materialize only if Apache Spark can enable good ML frameworks. Here, an ML framework means a system or an approach that combines all the ML elements including ML algorithms to make ML most effective to its users. And specifically, it refers to the ways that data is represented and processed, how predictive models are represented and estimated, how modeling results are evaluated, and are utilized. From this perspective, ML Frameworks are different from each other, for their handling of data sources, conducting data pre-processing, implementing algorithms, and for their support for complex computation.

There are many ML frameworks, as there are also various computing platforms supporting these frameworks. Among the available ML frameworks, the frameworks stressing iterative computing and interactive manipulation are considered among the best, because these features can facilitate complex predictive model estimation and good researcher-data interaction. Nowadays, good ML frameworks also need to cover big data capabilities or fast processing at scale, as well as fault tolerance capabilities. Good frameworks always include a large number of machine learning algorithms and statistical tests ready to be used.

As mentioned in previous sections, Apache Spark has excellent iterative computing performance and is highly cost-effective, thanks to in-memory data processing. It's compatible with all of Hadoop's data sources and file formats and, thanks to friendly APIs that they are available in several languages, it also has a faster learning curve. Apache Spark even includes graph processing and machine-learning capabilities. For these reasons, Apache Spark based ML frameworks are favored by ML professionals.

However, Hadoop MapReduce is a more mature platform and it was built for batch processing. It can be more cost-effective than Spark, for some big data that doesn't fit in memory and also due to the greater availability of experienced staff. Furthermore, the Hadoop MapReduce ecosystem is currently bigger thanks to many supporting projects, tools, and cloud services.

But even if Spark looks like the big winner, the chances are that ML professionals won't use it on its own, ML professionals may still need HDFS to store the data and may want to use HBase, Hive, Pig, Impala, or other Hadoop projects. For many cases, this means ML professionals still need to run Hadoop and MapReduce alongside Apache Spark for a full Big Data package.

RM4Es

In a previous section, we have had some general discussion about machine learning frameworks. Specifically, a ML framework covers how to deal with data, analytical methods, analytical computing, results evaluation, and results utilization, which RM4Es represents nicely as a framework. The RM4Es (Research Methods Four Elements) is a good framework to summarize Machine Learning components and processes. The RM4Es include:

  • Equation: Equations are used to represent the models for our research
  • Estimation: Estimation is the link between equations (models) and the data used for our research
  • Evaluation: Evaluation needs to be performed to assess the fit between models and the data
  • Explanation: Explanation is the link between equations (models) and our research purposes. How we explain our research results often depends on our research purposes and also on the subject we are studying

The RM4Es are the key four aspects that distinguish one machine learning method from another. The RM4Es are sufficient to represent an ML status at any given moment. Furthermore, using RM4Es can easily and sufficiently represent ML workflows.

Related to what we discussed so far, Equation is like ML libraries, Estimation represents how computing is done, Evaluation is about how to tell whether a ML is better, and, as for iterative computer, whether we should continue or stop. Explanation is also a key part for ML as our goal is to turn data into insightful results that can be used.

Per the above, a good ML framework needs to deal with data abstraction and data pre-processing at scale, and also needs to deal with fast computing, interactive evaluation at scale and speed, as well as easy results interpretation and deployment.

The Spark computing framework

Earlier in the chapter, we discussed how Spark computing supports iterative ML computing. After reviewing machine learning frameworks and how Spark computing relates to ML frameworks, we are ready to understand more about why Spark computing should be selected for ML.

Spark was built to serve ML and data science, to make ML at scale and ML deployment easy. As discussed, Spark's core innovation on RDDs enables fast and easy computing, with good fault tolerance.

Spark is a general computing platform, and its program contains two programs: a driver program and a worker program.

To program, developers need to write a driver program that implements the high-level control flow of their application and also launches various operations in parallel. All the worker programs developed will run on cluster nodes or in local threads, and RDDs operate across all workers.

As mentioned, Spark provides two main abstractions for parallel programming: resilient distributed datasets and parallel operations on these datasets (invoked by passing a function to apply on a dataset).

In addition, Spark supports two restricted types of shared variables:

  • Broadcast variables: If a large read-only piece of data (e.g., a lookup table) is used in multiple parallel operations, it is preferable to distribute it to the workers only once instead of packaging it with every closure.
  • Accumulators: These are variables that workers can only add to using an associative operation, and that only the driver can read. They can be used to implement counters as in MapReduce and to provide a more imperative syntax for parallel sums. Accumulators can be defined for any type that has an add operation and a zero value. Due to their add-only semantics, they are easy to make fault-tolerant.

With all the above, the Apache Spark computing framework is capable of supporting various machine learning frameworks that need fast parallel computing with fault tolerance.

ML workflows and Spark pipelines

In this section, we provide an introduction to machine learning workflows, and also Spark pipelines, and then discuss how Spark pipeline can serve as a good tool of computing ML workflows.

After this section, readers will master these two important concepts, and be ready to program and implement Spark pipelines for machine learning workflows.

ML as a step-by-step workflow

Almost all ML projects involve cleaning data, developing features, estimating models, evaluating models, and then interpreting results, which all can be organized into some step by step workflows. These workflows are sometimes called analytical processes.

Some people even define machine learning as workflows of turning data into actionable insights, for which some people will add business understanding or problem definition into the workflows as their starting points.

In the data mining field, Cross Industry Standard Process for Data Mining (CRISP-DM) is a widely accepted workflow standard, which is still widely adopted. And many standard ML workflows are just some form of revision to the CRISP-DM workflow.

ML as a step-by-step workflow

As illustrated in the above picture, for any standard CRISP-DM workflow, we need all the following 6 steps:

  1. Business understanding
  2. Data understanding
  3. Data preparation
  4. Modeling
  5. Evaluation
  6. Deployment

To which some people may add analytical approaches selection and results explanation, to make it more complete. For complicated machine learning projects, there will be some branches and feedback loops to make workflows very complex.

In other words, for some machine learning projects, after we complete model evaluation, we may go back to the step of modeling or even data preparation. After the data preparation step, we may branch out for more than two types of modeling.

ML workflow examples

To further understand machine learning workflows, let us review some examples here.

In the later chapters of this book, we will work on risk modelling, fraud detection, customer view, churn prediction, and recommendation. For many of these types of projects, the goal is often to identify causes of certain problems, or to build a causal model. Below is one example of a workflow to develop a causal model.

  1. Check data structure to ensure a good understanding of the data:
    • Is the data a cross sectional data? Is implicit timing incorporated?
    • Are categorical variables used?
  2. Check missing values:
    • Don't know or forget as an answer may be recoded as neutral or treated as a special category
    • Some variables may have a lot of missing values
    • To recode some variables as needed
  3. Conduct some descriptive studies to begin telling stories:
    • Use comparing means and crosstabulations
    • Check variability of some key variables (standard deviation and variance)
  4. Select groups of ind variables (exogenous variables):
    • As candidates of causes
  5. Basic descriptive statistics:
    • Mean, standard deviaton, and frequencies for all variables
  6. Measurement work:
    • Study dimensions of some measurements (efa exploratory factor analysis may be useful here)
    • May form measurement models
  7. Local models:
    • Identify sections out from the whole picture to explore relationship
    • Use crosstabulations
    • Graphical plots
    • Use logistic regression
    • Use linear regression
  8. Conduct some partial correlation analysis to help model specification.
  9. Propose structural equation models by using the results of (8):
    • Identify main structures and sub structures
    • Connect measurements with structure models
  10. Initial fits:
    • Use spss to create data sets for lisrel or mplus
    • Programming in lisrel or mplus
  11. Model modification:
    • Use SEM results (mainly model fit indices) to guide
    • Re-analyze partial correlations
  12. Diagnostics:
    • Distribution
    • Residuals
    • Curves
  13. Final model estimation may be reached here:
    • If not repeat step 13 and 14
  14. Explaining the model (causal effects identified and quantified).

    Note

    Also refer to http://www.researchmethods.org/step-by-step1.pdf, Spark Pipelines

The Apache Spark team has recognized the importance of machine learning workflows and they have developed Spark Pipelines to enable good handling of them.

Spark ML represents a ML workflow as a pipeline, which consists of a sequence of PipelineStages to be run in a specific order.

PipelineStages include Spark Transformers, Spark Estimators and Spark Evaluators.

ML workflows can be very complicated, so that creating and tuning them is very time consuming. The Spark ML Pipeline was created to make the construction and tuning of ML workflows easy, and especially to represent the following main stages:

  1. Loading data
  2. Extracting features
  3. Estimating models
  4. Evaluating models
  5. Explaining models

With regards to the above tasks, Spark Transformers can be used to extract features. Spark Estimators can be used to train and estimate models, and Spark Evaluators can be used to evaluate models.

Technically, in Spark, a Pipeline is specified as a sequence of stages, and each stage is either a Transformer, an Estimator, or an Evaluator. These stages are run in order, and the input dataset is modified as it passes through each stage. For Transformer stages, the transform() method is called on the dataset. For estimator stages, the fit() method is called to produce a Transformer (which becomes part of the PipelineModel, or fitted Pipeline), and that Transformer's transform() method is called on the dataset.

The specifications given above are all for linear Pipelines. It is possible to create non-linear Pipelines as long as the data flow graph forms a Directed Acyclic Graph (DAG).

Note

For more info on Spark pipeline, please visit:

http://spark.apache.org/docs/latest/ml-guide.html#pipeline

Spark notebooks

In this section, we first discuss about notebook approaches for machine learning. Then we provide a full introduction to R Markdown as a mature notebook example, and then introduce Spark's R notebook to complete this section.

After this section, readers will master these notebook approaches as well as some related concepts, and be ready to use them for managing and programming machine learning projects.

Notebook approach for ML

Notebook became a favored machine learning approach, not only for its dynamics, but also for reproducibility.

Most notebook interfaces are comprised of a series of code blocks, called cells. The development process is a discovery type, for which a developer can develop and run codes in one cell, and then can continue to write code in a subsequent cell depending on the results from the first cell. Particularly when analyzing large datasets, this interactive type of approach allows machine learning professionals to quickly discover patterns or insights into data. Therefore, notebook-style development processes provide some exploratory and interactive ways to write code and immediately examine results.

Notebook allows users to seamlessly mix code, outputs, and markdown comments all in the same document. With everything in one document, it makes it easier for machine learning professionals to reproduce their work at a later stage.

This notebook approach was adopted to ensure reproducibility, to align analysis with computation, and to align analysis with presentation, so to end the copy and paste way of research management.

Specifically, using notebook allows users to:

  • Analyze iteratively
  • Report transparently
  • Collaborate seamlessly
  • Compute with clarity
  • Assess reasoning, not only results
  • The note book approach also provides a unified way to integrate many analytical tools for machine learning practice.

    Note

    For more about adopting an approach for reproducibility, please visit http://chance.amstat.org/2014/09/reproducible-paradigm/ R Markdown

R Markdown is a very popular tool helping data scientists and machine learning professionals to generate dynamic reports, and also making their analytical workflows reproducible. R Markdown is one of the pioneer notebook tools.

According to RStudio

"R Markdown is a format that enables easy authoring of reproducible web reports from R. It combines the core syntax of Markdown (an easy-to-write plain text format for web content) with embedded R code chunks that are run so their output can be included in the final document".

Therefore, we can use R and the Markdown package plus some other dependent packages like knitr, to author reproducible analytical reports. However, utilizing RStudio and the Markdown package together makes things easy for data scientists.

Using the Markdown is very easy for R users. As an example, let us create a report in the following three simple steps:

Step 1: Getting the software ready

  1. Download R studio at : http://rstudio.org/
  2. Set options for R studio: Tools > Options > Click on Sweave and choose Knitr at Weave Rnw files using Knitr.

Step 2: Installing the Knitr package

  1. To install a package in RStudio, you use Tools > Install Packages and then select a CRAN mirror and package to install. Another way to install packages is to use the function install.packages().
  2. To install the knitr package from the Carnegi Mellon Statlib CRAN mirror, we can use: install.packages("knitr", repos = "http://lib.stat.cmu.edu/R/CRAN/")

Step 3: Creating a simple report

  1. Create a blank R Markdown file: File > New > R Markdown. You will open a new .Rmd file.
  2. When you create the blank file, you can see an already-written module. One simple way to go is to replace the corresponding parts with your own information.
    Step 3: Creating a simple report
  3. After all your information is entered, click Knit HTML.
    Step 3: Creating a simple report
  4. Now you will see that you have generated an .html file.

Spark notebooks

There are a few notebooks compatible with Apache Spark computing. Among them, Databricks is one of the best, as it was developed by the original Spark team. The Databricks Notebook is similar to the R Markdown, but is seamlessly integrated with Apache Spark.

Besides SQL, Python, and Scala, now the Databricks notebook is also available for R, and Spark 1.4 includes the SparkR package by default. That is, from now on, data scientists and machine learning professionals can effortlessly benefit from the power of Apache Spark in their R environment, by writing and running R notebooks on top of Spark.

In addition to SparkR, any R package can be easily installed into the Databricks R notebook by using install.packages(). So, with the Databricks R notebook, data scientists and machine learning professionals can have the power of R Markdown on top of Spark. By using SparkR, data scientists and machine learning professionals can access and manipulate very large data sets (e.g. terabytes of data) from distributed storage (e.g. Amazon S3) or data warehouses (e.g. Hive). Data scientists and machine learning professionals can even collect a SparkR DataFrame to local data frames.

Visualization is a critical part of any machine learning project. In R Notebooks, data scientists and machine learning professionals can use any R visualization library, including R's base plotting, ggplot, or Lattice. Like R Markdown, plots are displayed inline in the R notebook. Users can apply Databricks' built-in display() function on any R DataFrame or SparkR DataFrame. The result will appear as a table in the notebook, which can then be plotted with one click. Similar to other Databricks notebooks like the Python notebook, data scientists can also use displayHTML() function in R notebooks to produce any HTML and Javascript visualization.

Databricks' end-to-end solution also makes building a machine learning pipeline easy from ingest to production, which applies to R Notebooks as well: Data scientists can schedule their R notebooks to run as jobs on Spark clusters. The results of each job, including visualizations, are immediately available to browse, making it much simpler and faster to turn the work into production.

To sum up, R Notebooks in Databricks let R users take advantage of the power of Spark through simple Spark cluster management, rich one-click visualizations, and instant deployment to production jobs. It also offers a 30-day free trial.

Summary

This chapter covers all the basics of Apache Spark, which all machine learning professionals are expected to understand in order to utilize Apache Spark for practical machine learning projects. We focus our discussion on Apache Spark computing, and relate it to some of the most important machine learning components, in order to connect Apache Spark and machine learning together to fully prepare our readers for machine learning projects.

First, we provided a Spark overview, and also discussed Spark's advantages as well as Spark's computing model for machine learning.

Second, we reviewed machine learning algorithms, Spark's MLlib libraries, and other machine learning libraries.

In the third section, Spark's core innovations of RDD and DataFrame has been discussed, as well as Spark's DataFrame API for R.

Fourth, we reviewed some ML frameworks, and specifically discussed a RM4Es framework for machine learning as an example, and then further discussed Spark computing frameworks for machine learning.

Fifth, we discussed machine learning as workflows, went through one workflow example, and then reviewed Spark's pipelines and its API.

Finally, we studied the notebook approach for machine learning, and reviewed R's famous notebook Markdown, then we discussed a Spark Notebook provided by Databricks, so we can use Spark Notebook to unite all the above Spark elements for machine learning practice easily.

With all the above Spark basics covered, the readers should be ready to start utilizing Apache Spark for some machine learning projects from here on. Therefore, we will work on data preparation on Spark in the next chapter, then jump into our first real life machine learning projects in Chapter 3, A Holistic View on Spark.

Left arrow icon Right arrow icon

Key benefits

  • Customize Apache Spark and R to fit your analytical needs in customer research, fraud detection, risk analytics, and recommendation engine development
  • Develop a set of practical Machine Learning applications that can be implemented in real-life projects
  • A comprehensive, project-based guide to improve and refine your predictive models for practical implementation

Description

There's a reason why Apache Spark has become one of the most popular tools in Machine Learning – its ability to handle huge datasets at an impressive speed means you can be much more responsive to the data at your disposal. This book shows you Spark at its very best, demonstrating how to connect it with R and unlock maximum value not only from the tool but also from your data. Packed with a range of project "blueprints" that demonstrate some of the most interesting challenges that Spark can help you tackle, you'll find out how to use Spark notebooks and access, clean, and join different datasets before putting your knowledge into practice with some real-world projects, in which you will see how Spark Machine Learning can help you with everything from fraud detection to analyzing customer attrition. You'll also find out how to build a recommendation engine using Spark's parallel computing powers.

Who is this book for?

If you are a data scientist, a data analyst, or an R and SPSS user with a good understanding of machine learning concepts, algorithms, and techniques, then this is the book for you. Some basic understanding of Spark and its core elements and application is required.

What you will learn

  • Set up Apache Spark for machine learning and discover its impressive processing power
  • Combine Spark and R to unlock detailed business insights essential for decision making
  • Build machine learning systems with Spark that can detect fraud and analyze financial risks
  • Build predictive models focusing on customer scoring and service ranking
  • Build a recommendation systems using SPSS on Apache
  • Spark
  • Tackle parallel computing and find out how it can support your machine learning projects
  • Turn open data and communication data into actionable insights by making use of various forms of machine learning

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : May 30, 2016
Length: 252 pages
Edition : 1st
Language : English
ISBN-13 : 9781785887789
Vendor :
Apache
Category :
Languages :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : May 30, 2016
Length: 252 pages
Edition : 1st
Language : English
ISBN-13 : 9781785887789
Vendor :
Apache
Category :
Languages :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just Mex$85 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just Mex$85 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total Mex$ 2,810.97
Fast Data Processing with Spark 2
Mex$902.99
Apache Spark Machine Learning Blueprints
Mex$902.99
Real-Time Big Data Analytics
Mex$1004.99
Total Mex$ 2,810.97 Stars icon
Banner background image

Table of Contents

12 Chapters
1. Spark for Machine Learning Chevron down icon Chevron up icon
2. Data Preparation for Spark ML Chevron down icon Chevron up icon
3. A Holistic View on Spark Chevron down icon Chevron up icon
4. Fraud Detection on Spark Chevron down icon Chevron up icon
5. Risk Scoring on Spark Chevron down icon Chevron up icon
6. Churn Prediction on Spark Chevron down icon Chevron up icon
7. Recommendations on Spark Chevron down icon Chevron up icon
8. Learning Analytics on Spark Chevron down icon Chevron up icon
9. City Analytics on Spark Chevron down icon Chevron up icon
10. Learning Telco Data on Spark Chevron down icon Chevron up icon
11. Modeling Open Data on Spark Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
(1 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 100%
Sven Feb 17, 2017
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
Vous trouverez beaucoup mieux directement sur la documentation de Spark ou sur google. Ce livre n'apporte aucune plus value, ni dans la forme, ni dans la structure, ni dans le contenu, ni dans les use cases, ni dans son originalité,...
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.