Packt+ | Advance your knowledge in tech

You're reading from Spark Cookbook

Product type Book

Published in Jul 2015

Publisher

ISBN-13 9781783987061

Pages 226 pages

Edition 1st Edition

Languages

Concepts

Data Analysis

Author (1):

Rishi Yadav

Table of Contents (19) Chapters

Spark Cookbook

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

1. Getting Started with Apache Spark

2. Developing Applications with Spark

3. External Data Sources

4. Spark SQL

5. Spark Streaming

6. Getting Started with Machine Learning Using MLlib

7. Supervised Learning with MLlib – Regression

8. Supervised Learning with MLlib – Classification

9. Unsupervised Learning with MLlib

10. Recommender Systems

11. Graph Processing Using GraphX

12. Optimizations and Performance Tuning

Index

A

Alternating Least Squares (ALS)
- about / Collaborative filtering using explicit feedback
Amazon EC2
- about / Launching Spark on Amazon EC2
- features / Launching Spark on Amazon EC2
- Spark, launching / Launching Spark on Amazon EC2, Getting ready, How to do it...
- URL / Getting ready
Amazon Elastic Block Storage (EBS)
- about / Loading data from Amazon S3
Amazon Elastic Compute Cloud (EC2)
- about / Loading data from Amazon S3
Amazon S3
- data, loading / Loading data from Amazon S3, How to do it...
- about / Loading data from Amazon S3
- URL / Getting ready
Amazon Web Services (AWS)
- about / Loading data from Amazon S3
- URL / How to do it...
Apache Cassandra
- about / Loading data from Apache Cassandra
- data, loading / Loading data from Apache Cassandra, How to do it..., There's more...
arbitrary source
- data, saving / Loading and saving data from an arbitrary source, How to do it...
- data, loading / Loading and saving data from an arbitrary source, How to do it...

B

batch interval
- about / Introduction
bias
- versus variance / Doing linear regression with lasso
- about / Doing linear regression with lasso
binaries
- Spark, installing / Getting ready, How to do it...
binary classification
- performing, with SVM / Doing binary classification using SVM, How to do it…
bivariate analysis
- about / Introduction
broker
- about / Streaming using Kafka

C

case classes
- used, for inferring schema / Inferring schema using case classes, How to do it...
Catalyst optimizer
- about / Understanding the Catalyst optimizer
- goals / How it works…
- using, in analysis phase / Analysis
- using, in logical plan optimization phase / Logical plan optimization
- using, in physical planning phase / Physical planning
- using, in code generation phase / Code generation
classification
- about / Introduction
- performing, with logistic regression / Doing classification using logistic regression, Getting ready, How to do it…
- performing, with decision trees / Doing classification using decision trees, Getting ready, How to do it…, How it works…
- performing, with Random Forests / Doing classification using Random Forests, Getting ready, How to do it…, How it works…
- performing, with Gradient Boosted Trees / Doing classification using Gradient Boosted Trees, How to do it…
- performing, with Naïve Bayes / Doing classification with Naïve Bayes, How to do it…
cluster centroids
- about / Clustering using k-means
clustering
- about / Introduction, Clustering using k-means
- k-means algorithm, using / Clustering using k-means, Getting ready, How to do it…
collaborative filtering
- about / Collaborative filtering using explicit feedback
- explicit feedback, using / Collaborative filtering using explicit feedback, Getting ready, How to do it…
- implicit feedback, using / Collaborative filtering using implicit feedback, Getting ready, How it works…, There's more…
comma-separate value (CSV) file
- about / Getting ready
complex event processing (CEP)
- about / Introduction
compression
- about / Using compression to improve performance
- used, for performance improvement / Using compression to improve performance
concurrent mark and sweep (CMS)
- about / Optimizing memory
connected component
- searching / Finding connected components, Getting ready, How to do it…
Connector/J
- URL / How to do it...
connector library
- about / There's more...
consumers
- about / Streaming using Kafka
correlation
- about / Calculating correlation
- calculating / Calculating correlation, Getting ready, How to do it…
- positive correlation / Calculating correlation
- negative correlation / Calculating correlation
cost function
- about / Understanding cost function
- analyzing, for linear regression / Understanding cost function
custom InputFormat
- used, for loading data from HDFS / Loading data from HDFS using a custom InputFormat, How to do it...

D

data
- loading, from local filesystem / Loading data from the local filesystem, How to do it...
- loading, from HDFS / Loading data from HDFS, How to do it..., There's more…
- loading from HDFS, custom InputFormat used / Loading data from HDFS using a custom InputFormat, How to do it...
- loading, from Amazon S3 / Loading data from Amazon S3, How to do it...
- loading, from Apache Cassandra / Loading data from Apache Cassandra, How to do it..., There's more...
- loading, from relational databases / Loading data from relational databases, How to do it..., How it works…, Loading and saving data from relational databases, How to do it...
- loading, in Parquet format / Loading and saving data using the Parquet format, How to do it..., How it works…, There's more…
- saving, in Parquet format / Loading and saving data using the Parquet format, How to do it..., How it works…, There's more…
- loading, in JSON format / Loading and saving data using the JSON format, How to do it..., How it works…
- saving, in JSON format / Loading and saving data using the JSON format, How to do it..., How it works…
- saving, from relational databases / Loading and saving data from relational databases, How to do it...
- loading, from arbitrary source / Loading and saving data from an arbitrary source, How to do it...
- saving, from arbitrary source / Loading and saving data from an arbitrary source, How to do it...
DataFrame
- about / Introduction
data rate
- about / Introduction
data source API
- URL / There's more…
decision trees
- classification, performing / Doing classification using decision trees, Getting ready, How to do it…, How it works…
dimensionality reduction
- about / Dimensionality reduction with principal component analysis
- purposes / Dimensionality reduction with principal component analysis
- with principal component analysis (PCA) / Dimensionality reduction with principal component analysis, Getting ready, How to do it…
- with singular value decomposition (SVD) / Dimensionality reduction with singular value decomposition, Getting ready, How to do it…
directed graph
- about / Introduction
directories
- ephemeral-hdfs / How to do it...
- persistent-hdfs / How to do it...
- hadoop-native / How to do it...
- Scala / How to do it...
- Shark / How to do it...
- Spark / How to do it...
- spark-ec2 / How to do it...
- Tachyon / How to do it...
Discretized Stream (DStream)
- about / Introduction
distributed graph processing
- data parallel / Introduction
- graph parallel / Introduction
distributed matrix
- about / Creating matrices
- RowMatrix / Creating matrices
- IndexedRowMatrix / Creating matrices
- CoordinateMatrix / Creating matrices
domain-specific language (DSL)
- about / Introduction

E

Eclipse
- Spark application, developing with Maven / Developing Spark applications in Eclipse with Maven, How to do it...
- URL / Getting ready
- Spark application, developing with SBT / Developing Spark applications in Eclipse with SBT, How to do it...
Eden
- about / Optimizing memory
ensemble learning algorithms
- about / Doing classification using Random Forests
Estimator
- about / Getting ready
explicit feedback
- used, for collaborative filtering / Collaborative filtering using explicit feedback, Getting ready, How to do it…

F

fat-free XML
- about / Loading and saving data using the JSON format
features, vectors
- about / Creating vectors
feature scaling
- about / Getting ready
- performing / Getting ready

G

garbage-first GC (G1)
- about / Optimizing memory
garbage collection
- optimizing / Optimizing garbage collection, How to do it…
garbage collector (GC)
- about / Optimizing memory
Gradient Boosted Trees (GBTs)
- about / Doing classification using Gradient Boosted Trees
- classification, performing / Doing classification using Gradient Boosted Trees, How to do it…
gradient descent
- about / Understanding cost function
graphs
- directed graph / Introduction
- regular graph / Introduction
- fundamental operations / Fundamental operations on graphs, How to do it…

H

Hadoop distributed file system (HDFS)
- about / How to do it...
HDFS
- about / Introduction
- data, loading / Loading data from HDFS, How to do it..., There's more…
- data loading, custom InputFormat used / Loading data from HDFS using a custom InputFormat, How to do it...
HiveContext
- about / Creating HiveContext
- features / Creating HiveContext
- creating / Creating HiveContext, Getting ready, How to do it...
hyperspace
- about / Creating vectors
hypothesis function
- about / Getting ready, Understanding cost function
hypothesis testing
- about / Doing hypothesis testing
- performing / Doing hypothesis testing, How to do it…

I

implicit feedback
- used, for collaborative filtering / Collaborative filtering using implicit feedback, Getting ready, How it works…, There's more…
InputFormat storage format
- about / Introduction
IntelliJ idea
- Spark application, developing with Maven / Developing a Spark application in IntelliJ IDEA with Maven, How to do it...
- Spark application, developing with SBT / Developing a Spark application in IntelliJ IDEA with SBT, How to do it...

J

JdbcRDD
- about / Loading data from relational databases, How it works…
JSON format
- data, loading / Loading and saving data using the JSON format, How to do it..., How it works…
- data, saving / Loading and saving data using the JSON format, How to do it..., How it works…

K

k-means algorithm
- using / Clustering using k-means, Getting ready, How to do it…
- cluster assignment step / Clustering using k-means
- move centroid step / Clustering using k-means
Kafka
- about / Streaming using Kafka
- using / Streaming using Kafka, How to do it..., There's more…
kilobytes per second (kbps)
- about / Introduction
Kryo library
- about / Using serialization to improve performance

L

labeled point
- about / Creating a labeled point
- creating / Creating a labeled point, How to do it…
lasso
- about / Doing linear regression with lasso
- linear regression, performing / Doing linear regression with lasso, How to do it…
- URL / Doing linear regression with lasso
latent features
- about / Introduction
level of parallelism
- optimizing / Optimizing the level of parallelism
leverage application semantics
- used, for manual memory management / Manual memory management by leverage application semantics
lineage
- about / Introduction
linear regression
- about / Using linear regression, Understanding cost function
- using / Getting ready, How to do it…
- analyzing, for cost function / Understanding cost function
- performing, with lasso / Doing linear regression with lasso, How to do it…
local filesystem
- data, loading / Loading data from the local filesystem, How to do it...
local matrix
- about / Creating matrices
logistic function
- about / Doing classification using logistic regression
logistic regression
- classification, performing / Doing classification using logistic regression, Getting ready, How to do it…
LZO
- about / Using compression to improve performance

M

machine learning
- about / Introduction
machine learning pipelines
- creating, ML library used / Creating machine learning pipelines using ML, Getting ready, How to do it…
manual memory management
- by leverage application semantics / Manual memory management by leverage application semantics
matrices
- about / Creating matrices
- creating / Creating matrices, How to do it…
- local matrix / Creating matrices
- distributed matrix / Creating matrices
Maven
- Spark source code, building / Building the Spark source code with Maven, How to do it...
- Spark application, developing in Eclipse / Developing Spark applications in Eclipse with Maven, How to do it...
- about / Developing Spark applications in Eclipse with Maven
- features / Developing Spark applications in Eclipse with Maven
- Spark application, developing in IntelliJ idea / Developing a Spark application in IntelliJ IDEA with Maven, How to do it...
measurement scales
- Nominal Scale / Introduction
- Ordinal Scale / Introduction
- Interval Scale / Introduction
- Ratio Scale / Introduction
megabytes per second (mbps)
- about / Introduction
memory optimization
- about / Optimizing memory
- improvements / Optimizing memory
- aspects / Optimizing memory
Mesos
- about / Introduction, Deploying on a cluster with Mesos
- Spark, deploying / Deploying on a cluster with Mesos, How to do it...
- fine-grained mode / How to do it...
- coarse-grained mode / How to do it...
ML library
- used, for creating machine learning pipelines / Creating machine learning pipelines using ML, Getting ready, How to do it…
MovieLens dataset
- URL / Introduction
multigraph
- about / Introduction
multivariate analysis
- about / Introduction

N

Naïve Bayes
- classification, performing / Doing classification with Naïve Bayes, How to do it…
Naïve Bayes assumption
- about / Doing classification with Naïve Bayes
Naïve Bayes classifier
- about / Doing classification with Naïve Bayes
negative correlation
- about / Calculating correlation
neighborhood aggregation
- performing / Performing neighborhood aggregation, How to do it…
null hypothesis
- about / Doing hypothesis testing

O

old collection
- about / Optimizing memory
ordinary least squares (OLS)
- about / Doing linear regression with lasso
- prediction accuracy / Doing linear regression with lasso
- interpretation / Doing linear regression with lasso
OutputFormat storage format
- about / Introduction
overfitting
- about / How it works…

P

PageRank
- about / Using PageRank
- using / Using PageRank, Getting ready, How to do it…
parallel edges
- about / Introduction
Parquet format
- data, saving / Loading and saving data using the Parquet format, How to do it..., How it works…, There's more…
- data, loading / Loading and saving data using the Parquet format, How to do it..., How it works…, There's more…
partitioned log
- about / Streaming using Kafka
performance improvement
- with compression / Using compression to improve performance
- with serialization / Using serialization to improve performance
plain old Java objects (POJOs)
- about / Inferring schema using case classes
positive correlation
- about / Calculating correlation
principal component analysis (PCA)
- about / Dimensionality reduction with principal component analysis
- using / Dimensionality reduction with principal component analysis, Getting ready, How to do it…
producers
- about / Streaming using Kafka
projection error
- about / Dimensionality reduction with principal component analysis
project Tungsten
- about / Understanding the future of optimization – project Tungsten
- manual memory management / Manual memory management by leverage application semantics
- algorithms, using / Using algorithms and data structures
- data structures, using / Using algorithms and data structures
- code generation / Code generation

Q

Quasi quotes
- about / Code generation

R

Random Forests
- classification, performing / Doing classification using Random Forests, Getting ready, How to do it…, How it works…
RDD
- about / Introduction
- wordcount example / Introduction
recommender systems
- about / Introduction
regression
- about / Introduction
relational databases
- data, loading / Loading data from relational databases, How to do it..., How it works…, Loading and saving data from relational databases, How to do it...
- data, saving / Loading and saving data from relational databases, How to do it...
resilient distributed property graph
- about / Introduction
ridge regression
- about / Doing ridge regression
- performing / Doing ridge regression, How to do it…
Root Mean Square Error (RMSE)
- about / Collaborative filtering using explicit feedback

S

s3*//
- about / How to do it...
s3n*//
- about / How to do it...
SBT
- about / Developing Spark applications in Eclipse with SBT
- Spark application, developing in Eclipse / Developing Spark applications in Eclipse with SBT, How to do it...
- Spark application, developing in IntelliJ idea / Developing a Spark application in IntelliJ IDEA with SBT, How to do it...
sbt-assembly plugin
- merge strategies / Merge strategies in sbt-assembly
schema
- inferring, case classes used / Inferring schema using case classes, How to do it...
- programmatically specifying / Programmatically specifying the schema, How to do it..., How it works…
SchemaRDD
- about / Introduction
secure shell protocol (SSH)
- about / How to do it...
serialization
- used, for performance improvement / Using serialization to improve performance
sigmoid function
- about / Doing classification using logistic regression
singular value decomposition (SVD)
- using / Dimensionality reduction with singular value decomposition, Getting ready, How to do it…
sliding window, parameters
- window length / Introduction
- sliding interval / Introduction
Snappy
- about / Using compression to improve performance
Spark
- about / Introduction
- ecosystem / Introduction
- URL / Installing Spark from binaries
- installing, from binaries / Getting ready, How to do it...
- source code, building with Maven / Building the Spark source code with Maven, How to do it...
- launching, on Amazon EC2 / Launching Spark on Amazon EC2, Getting ready, How to do it...
- deploying, on cluster in standalone mode / Deploying on a cluster in standalone mode, How to do it..., How it works...
- deploying, on cluster with Mesos / Deploying on a cluster with Mesos, How to do it...
- deploying, on cluster with YARN / Deploying on a cluster with YARN, How to do it..., How it works…
spark-ec2 script
- about / Getting ready
Spark 1.3 version
- URL / How to do it...
Spark application
- developing, in Eclipse with Maven / Developing Spark applications in Eclipse with Maven, How to do it...
- developing, in Eclipse with SBT / Developing Spark applications in Eclipse with SBT, How to do it...
- developing, in IntelliJ idea with Maven / Developing a Spark application in IntelliJ IDEA with Maven, How to do it...
- developing, in IntelliJ idea / Developing a Spark application in IntelliJ IDEA with SBT, How to do it...
Spark master
- about / How it works...
Spark RDD
- about / Using Tachyon as an off-heap storage layer
- challenges / Using Tachyon as an off-heap storage layer
Spark shell
- exploring / Exploring the Spark shell, How to do it...
Spark SQL
- about / Introduction
squared error function
- about / Understanding cost function
Standalone mode
- about / Introduction
- reference link / See also
standalone mode
- Spark, deploying / Deploying on a cluster in standalone mode, How to do it..., How it works...
start-all.sh script
- about / How to do it...
start-master.sh script
- about / How to do it...
start-slaves.sh script
- about / How to do it...
stop-all.sh script
- about / How to do it...
stop-master.sh script
- about / How to do it...
stop-slaves.sh script
- about / How to do it...
Streaming
- about / Introduction
- used, for word count / Word count using Streaming, How to do it...
- with Kafka / Streaming using Kafka, How to do it..., There's more…
subgraph
- about / Finding connected components
summary statistics
- about / Calculating summary statistics
- calculating / Calculating summary statistics, How to do it…
supervised learning
- about / Introduction, Introduction
- regression / Introduction
- classification / Introduction
- example / Introduction
support vector machines (SVM)
- about / Introduction
support vectors
- about / Doing binary classification using SVM
SVM
- binary classification, performing / Doing binary classification using SVM, How to do it…

T

Tachyon
- about / Introduction
- using, as off-heap storage layer / Using Tachyon as an off-heap storage layer, How to do it...
- reference link / See also
text classification
- about / Doing classification with Naïve Bayes
topics
- about / Streaming using Kafka
training data
- about / Doing binary classification using SVM
Twitter data
- live streaming / Streaming Twitter data, How to do it...

U

unsupervised learning
- about / Introduction
use case, clustering
- market segmentation / Clustering using k-means
- social network analysis / Clustering using k-means
- data center computing clusters / Clustering using k-means
- astronomical data analysis / Clustering using k-means
- real estate / Clustering using k-means
- text analysis / Clustering using k-means

V

variance
- versus bias / Doing linear regression with lasso
- about / Doing linear regression with lasso
vectors
- creating / Creating vectors, How it works...

W

Wikipedia page link data
- URL / Getting ready
word count
- with Streaming / Word count using Streaming, How to do it...
worker
- about / How it works...

Y

YARN
- about / Introduction, Deploying on a cluster with YARN
- Spark, deploying on cluster / Deploying on a cluster with YARN, How to do it..., How it works…
- yarn-client mode / How it works…
- yarn-cluster mode / How it works…
- configuration parameters / How it works…
young collection
- about / Optimizing memory

Z

z density of house
- about / Getting ready

The rest of the chapter is locked

You're reading from Spark Cookbook

Table of Contents (19) Chapters

Index

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

Y

Z

Authors (1)

Personalised recommendations for you

You're reading from Spark Cookbook

Table of Contents (19) Chapters

Index

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

Y

Z

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you