Search icon CANCEL
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Spark Cookbook

You're reading from  Spark Cookbook

Product type Book
Published in Jul 2015
Publisher
ISBN-13 9781783987061
Pages 226 pages
Edition 1st Edition
Languages
Author (1):
Rishi Yadav Rishi Yadav
Profile icon Rishi Yadav

Table of Contents (19) Chapters

Spark Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
1. Getting Started with Apache Spark 2. Developing Applications with Spark 3. External Data Sources 4. Spark SQL 5. Spark Streaming 6. Getting Started with Machine Learning Using MLlib 7. Supervised Learning with MLlib – Regression 8. Supervised Learning with MLlib – Classification 9. Unsupervised Learning with MLlib 10. Recommender Systems 11. Graph Processing Using GraphX 12. Optimizations and Performance Tuning Index

Index

A

  • Alternating Least Squares (ALS)
    • about / Collaborative filtering using explicit feedback
  • Amazon EC2
    • about / Launching Spark on Amazon EC2
    • features / Launching Spark on Amazon EC2
    • Spark, launching / Launching Spark on Amazon EC2, Getting ready, How to do it...
    • URL / Getting ready
  • Amazon Elastic Block Storage (EBS)
    • about / Loading data from Amazon S3
  • Amazon Elastic Compute Cloud (EC2)
    • about / Loading data from Amazon S3
  • Amazon S3
    • data, loading / Loading data from Amazon S3, How to do it...
    • about / Loading data from Amazon S3
    • URL / Getting ready
  • Amazon Web Services (AWS)
    • about / Loading data from Amazon S3
    • URL / How to do it...
  • Apache Cassandra
    • about / Loading data from Apache Cassandra
    • data, loading / Loading data from Apache Cassandra, How to do it..., There's more...
  • arbitrary source
    • data, saving / Loading and saving data from an arbitrary source, How to do it...
    • data, loading / Loading and saving data from an arbitrary source, How to do it...

B

  • batch interval
    • about / Introduction
  • bias
    • versus variance / Doing linear regression with lasso
    • about / Doing linear regression with lasso
  • binaries
    • Spark, installing / Getting ready, How to do it...
  • binary classification
    • performing, with SVM / Doing binary classification using SVM, How to do it…
  • bivariate analysis
    • about / Introduction
  • broker
    • about / Streaming using Kafka

C

  • case classes
    • used, for inferring schema / Inferring schema using case classes, How to do it...
  • Catalyst optimizer
    • about / Understanding the Catalyst optimizer
    • goals / How it works…
    • using, in analysis phase / Analysis
    • using, in logical plan optimization phase / Logical plan optimization
    • using, in physical planning phase / Physical planning
    • using, in code generation phase / Code generation
  • classification
    • about / Introduction
    • performing, with logistic regression / Doing classification using logistic regression, Getting ready, How to do it…
    • performing, with decision trees / Doing classification using decision trees, Getting ready, How to do it…, How it works…
    • performing, with Random Forests / Doing classification using Random Forests, Getting ready, How to do it…, How it works…
    • performing, with Gradient Boosted Trees / Doing classification using Gradient Boosted Trees, How to do it…
    • performing, with Naïve Bayes / Doing classification with Naïve Bayes, How to do it…
  • cluster centroids
    • about / Clustering using k-means
  • clustering
    • about / Introduction, Clustering using k-means
    • k-means algorithm, using / Clustering using k-means, Getting ready, How to do it…
  • collaborative filtering
    • about / Collaborative filtering using explicit feedback
    • explicit feedback, using / Collaborative filtering using explicit feedback, Getting ready, How to do it…
    • implicit feedback, using / Collaborative filtering using implicit feedback, Getting ready, How it works…, There's more…
  • comma-separate value (CSV) file
    • about / Getting ready
  • complex event processing (CEP)
    • about / Introduction
  • compression
    • about / Using compression to improve performance
    • used, for performance improvement / Using compression to improve performance
  • concurrent mark and sweep (CMS)
    • about / Optimizing memory
  • connected component
    • searching / Finding connected components, Getting ready, How to do it…
  • Connector/J
    • URL / How to do it...
  • connector library
    • about / There's more...
  • consumers
    • about / Streaming using Kafka
  • correlation
    • about / Calculating correlation
    • calculating / Calculating correlation, Getting ready, How to do it…
    • positive correlation / Calculating correlation
    • negative correlation / Calculating correlation
  • cost function
    • about / Understanding cost function
    • analyzing, for linear regression / Understanding cost function
  • custom InputFormat
    • used, for loading data from HDFS / Loading data from HDFS using a custom InputFormat, How to do it...

D

  • data
    • loading, from local filesystem / Loading data from the local filesystem, How to do it...
    • loading, from HDFS / Loading data from HDFS, How to do it..., There's more…
    • loading from HDFS, custom InputFormat used / Loading data from HDFS using a custom InputFormat, How to do it...
    • loading, from Amazon S3 / Loading data from Amazon S3, How to do it...
    • loading, from Apache Cassandra / Loading data from Apache Cassandra, How to do it..., There's more...
    • loading, from relational databases / Loading data from relational databases, How to do it..., How it works…, Loading and saving data from relational databases, How to do it...
    • loading, in Parquet format / Loading and saving data using the Parquet format, How to do it..., How it works…, There's more…
    • saving, in Parquet format / Loading and saving data using the Parquet format, How to do it..., How it works…, There's more…
    • loading, in JSON format / Loading and saving data using the JSON format, How to do it..., How it works…
    • saving, in JSON format / Loading and saving data using the JSON format, How to do it..., How it works…
    • saving, from relational databases / Loading and saving data from relational databases, How to do it...
    • loading, from arbitrary source / Loading and saving data from an arbitrary source, How to do it...
    • saving, from arbitrary source / Loading and saving data from an arbitrary source, How to do it...
  • DataFrame
    • about / Introduction
  • data rate
    • about / Introduction
  • data source API
    • URL / There's more…
  • decision trees
    • classification, performing / Doing classification using decision trees, Getting ready, How to do it…, How it works…
  • dimensionality reduction
    • about / Dimensionality reduction with principal component analysis
    • purposes / Dimensionality reduction with principal component analysis
    • with principal component analysis (PCA) / Dimensionality reduction with principal component analysis, Getting ready, How to do it…
    • with singular value decomposition (SVD) / Dimensionality reduction with singular value decomposition, Getting ready, How to do it…
  • directed graph
    • about / Introduction
  • directories
    • ephemeral-hdfs / How to do it...
    • persistent-hdfs / How to do it...
    • hadoop-native / How to do it...
    • Scala / How to do it...
    • Shark / How to do it...
    • Spark / How to do it...
    • spark-ec2 / How to do it...
    • Tachyon / How to do it...
  • Discretized Stream (DStream)
    • about / Introduction
  • distributed graph processing
    • data parallel / Introduction
    • graph parallel / Introduction
  • distributed matrix
    • about / Creating matrices
    • RowMatrix / Creating matrices
    • IndexedRowMatrix / Creating matrices
    • CoordinateMatrix / Creating matrices
  • domain-specific language (DSL)
    • about / Introduction

E

  • Eclipse
    • Spark application, developing with Maven / Developing Spark applications in Eclipse with Maven, How to do it...
    • URL / Getting ready
    • Spark application, developing with SBT / Developing Spark applications in Eclipse with SBT, How to do it...
  • Eden
    • about / Optimizing memory
  • ensemble learning algorithms
    • about / Doing classification using Random Forests
  • Estimator
    • about / Getting ready
  • explicit feedback
    • used, for collaborative filtering / Collaborative filtering using explicit feedback, Getting ready, How to do it…

F

  • fat-free XML
    • about / Loading and saving data using the JSON format
  • features, vectors
    • about / Creating vectors
  • feature scaling
    • about / Getting ready
    • performing / Getting ready

G

  • garbage-first GC (G1)
    • about / Optimizing memory
  • garbage collection
    • optimizing / Optimizing garbage collection, How to do it…
  • garbage collector (GC)
    • about / Optimizing memory
  • Gradient Boosted Trees (GBTs)
    • about / Doing classification using Gradient Boosted Trees
    • classification, performing / Doing classification using Gradient Boosted Trees, How to do it…
  • gradient descent
    • about / Understanding cost function
  • graphs
    • directed graph / Introduction
    • regular graph / Introduction
    • fundamental operations / Fundamental operations on graphs, How to do it…

H

  • Hadoop distributed file system (HDFS)
    • about / How to do it...
  • HDFS
    • about / Introduction
    • data, loading / Loading data from HDFS, How to do it..., There's more…
    • data loading, custom InputFormat used / Loading data from HDFS using a custom InputFormat, How to do it...
  • HiveContext
    • about / Creating HiveContext
    • features / Creating HiveContext
    • creating / Creating HiveContext, Getting ready, How to do it...
  • hyperspace
    • about / Creating vectors
  • hypothesis function
    • about / Getting ready, Understanding cost function
  • hypothesis testing
    • about / Doing hypothesis testing
    • performing / Doing hypothesis testing, How to do it…

I

  • implicit feedback
    • used, for collaborative filtering / Collaborative filtering using implicit feedback, Getting ready, How it works…, There's more…
  • InputFormat storage format
    • about / Introduction
  • IntelliJ idea
    • Spark application, developing with Maven / Developing a Spark application in IntelliJ IDEA with Maven, How to do it...
    • Spark application, developing with SBT / Developing a Spark application in IntelliJ IDEA with SBT, How to do it...

J

  • JdbcRDD
    • about / Loading data from relational databases, How it works…
  • JSON format
    • data, loading / Loading and saving data using the JSON format, How to do it..., How it works…
    • data, saving / Loading and saving data using the JSON format, How to do it..., How it works…

K

  • k-means algorithm
    • using / Clustering using k-means, Getting ready, How to do it…
    • cluster assignment step / Clustering using k-means
    • move centroid step / Clustering using k-means
  • Kafka
    • about / Streaming using Kafka
    • using / Streaming using Kafka, How to do it..., There's more…
  • kilobytes per second (kbps)
    • about / Introduction
  • Kryo library
    • about / Using serialization to improve performance

L

  • labeled point
    • about / Creating a labeled point
    • creating / Creating a labeled point, How to do it…
  • lasso
    • about / Doing linear regression with lasso
    • linear regression, performing / Doing linear regression with lasso, How to do it…
    • URL / Doing linear regression with lasso
  • latent features
    • about / Introduction
  • level of parallelism
    • optimizing / Optimizing the level of parallelism
  • leverage application semantics
    • used, for manual memory management / Manual memory management by leverage application semantics
  • lineage
    • about / Introduction
  • linear regression
    • about / Using linear regression, Understanding cost function
    • using / Getting ready, How to do it…
    • analyzing, for cost function / Understanding cost function
    • performing, with lasso / Doing linear regression with lasso, How to do it…
  • local filesystem
    • data, loading / Loading data from the local filesystem, How to do it...
  • local matrix
    • about / Creating matrices
  • logistic function
    • about / Doing classification using logistic regression
  • logistic regression
    • classification, performing / Doing classification using logistic regression, Getting ready, How to do it…
  • LZO
    • about / Using compression to improve performance

M

  • machine learning
    • about / Introduction
  • machine learning pipelines
    • creating, ML library used / Creating machine learning pipelines using ML, Getting ready, How to do it…
  • manual memory management
    • by leverage application semantics / Manual memory management by leverage application semantics
  • matrices
    • about / Creating matrices
    • creating / Creating matrices, How to do it…
    • local matrix / Creating matrices
    • distributed matrix / Creating matrices
  • Maven
    • Spark source code, building / Building the Spark source code with Maven, How to do it...
    • Spark application, developing in Eclipse / Developing Spark applications in Eclipse with Maven, How to do it...
    • about / Developing Spark applications in Eclipse with Maven
    • features / Developing Spark applications in Eclipse with Maven
    • Spark application, developing in IntelliJ idea / Developing a Spark application in IntelliJ IDEA with Maven, How to do it...
  • measurement scales
    • Nominal Scale / Introduction
    • Ordinal Scale / Introduction
    • Interval Scale / Introduction
    • Ratio Scale / Introduction
  • megabytes per second (mbps)
    • about / Introduction
  • memory optimization
    • about / Optimizing memory
    • improvements / Optimizing memory
    • aspects / Optimizing memory
  • Mesos
    • about / Introduction, Deploying on a cluster with Mesos
    • Spark, deploying / Deploying on a cluster with Mesos, How to do it...
    • fine-grained mode / How to do it...
    • coarse-grained mode / How to do it...
  • ML library
    • used, for creating machine learning pipelines / Creating machine learning pipelines using ML, Getting ready, How to do it…
  • MovieLens dataset
    • URL / Introduction
  • multigraph
    • about / Introduction
  • multivariate analysis
    • about / Introduction

N

  • Naïve Bayes
    • classification, performing / Doing classification with Naïve Bayes, How to do it…
  • Naïve Bayes assumption
    • about / Doing classification with Naïve Bayes
  • Naïve Bayes classifier
    • about / Doing classification with Naïve Bayes
  • negative correlation
    • about / Calculating correlation
  • neighborhood aggregation
    • performing / Performing neighborhood aggregation, How to do it…
  • null hypothesis
    • about / Doing hypothesis testing

O

  • old collection
    • about / Optimizing memory
  • ordinary least squares (OLS)
    • about / Doing linear regression with lasso
    • prediction accuracy / Doing linear regression with lasso
    • interpretation / Doing linear regression with lasso
  • OutputFormat storage format
    • about / Introduction
  • overfitting
    • about / How it works…

P

  • PageRank
    • about / Using PageRank
    • using / Using PageRank, Getting ready, How to do it…
  • parallel edges
    • about / Introduction
  • Parquet format
    • data, saving / Loading and saving data using the Parquet format, How to do it..., How it works…, There's more…
    • data, loading / Loading and saving data using the Parquet format, How to do it..., How it works…, There's more…
  • partitioned log
    • about / Streaming using Kafka
  • performance improvement
    • with compression / Using compression to improve performance
    • with serialization / Using serialization to improve performance
  • plain old Java objects (POJOs)
    • about / Inferring schema using case classes
  • positive correlation
    • about / Calculating correlation
  • principal component analysis (PCA)
    • about / Dimensionality reduction with principal component analysis
    • using / Dimensionality reduction with principal component analysis, Getting ready, How to do it…
  • producers
    • about / Streaming using Kafka
  • projection error
    • about / Dimensionality reduction with principal component analysis
  • project Tungsten
    • about / Understanding the future of optimization – project Tungsten
    • manual memory management / Manual memory management by leverage application semantics
    • algorithms, using / Using algorithms and data structures
    • data structures, using / Using algorithms and data structures
    • code generation / Code generation

Q

  • Quasi quotes
    • about / Code generation

R

  • Random Forests
    • classification, performing / Doing classification using Random Forests, Getting ready, How to do it…, How it works…
  • RDD
    • about / Introduction
    • wordcount example / Introduction
  • recommender systems
    • about / Introduction
  • regression
    • about / Introduction
  • relational databases
    • data, loading / Loading data from relational databases, How to do it..., How it works…, Loading and saving data from relational databases, How to do it...
    • data, saving / Loading and saving data from relational databases, How to do it...
  • resilient distributed property graph
    • about / Introduction
  • ridge regression
    • about / Doing ridge regression
    • performing / Doing ridge regression, How to do it…
  • Root Mean Square Error (RMSE)
    • about / Collaborative filtering using explicit feedback

S

  • s3*//
    • about / How to do it...
  • s3n*//
    • about / How to do it...
  • SBT
    • about / Developing Spark applications in Eclipse with SBT
    • Spark application, developing in Eclipse / Developing Spark applications in Eclipse with SBT, How to do it...
    • Spark application, developing in IntelliJ idea / Developing a Spark application in IntelliJ IDEA with SBT, How to do it...
  • sbt-assembly plugin
    • merge strategies / Merge strategies in sbt-assembly
  • schema
    • inferring, case classes used / Inferring schema using case classes, How to do it...
    • programmatically specifying / Programmatically specifying the schema, How to do it..., How it works…
  • SchemaRDD
    • about / Introduction
  • secure shell protocol (SSH)
    • about / How to do it...
  • serialization
    • used, for performance improvement / Using serialization to improve performance
  • sigmoid function
    • about / Doing classification using logistic regression
  • singular value decomposition (SVD)
    • using / Dimensionality reduction with singular value decomposition, Getting ready, How to do it…
  • sliding window, parameters
    • window length / Introduction
    • sliding interval / Introduction
  • Snappy
    • about / Using compression to improve performance
  • Spark
    • about / Introduction
    • ecosystem / Introduction
    • URL / Installing Spark from binaries
    • installing, from binaries / Getting ready, How to do it...
    • source code, building with Maven / Building the Spark source code with Maven, How to do it...
    • launching, on Amazon EC2 / Launching Spark on Amazon EC2, Getting ready, How to do it...
    • deploying, on cluster in standalone mode / Deploying on a cluster in standalone mode, How to do it..., How it works...
    • deploying, on cluster with Mesos / Deploying on a cluster with Mesos, How to do it...
    • deploying, on cluster with YARN / Deploying on a cluster with YARN, How to do it..., How it works…
  • spark-ec2 script
    • about / Getting ready
  • Spark 1.3 version
    • URL / How to do it...
  • Spark application
    • developing, in Eclipse with Maven / Developing Spark applications in Eclipse with Maven, How to do it...
    • developing, in Eclipse with SBT / Developing Spark applications in Eclipse with SBT, How to do it...
    • developing, in IntelliJ idea with Maven / Developing a Spark application in IntelliJ IDEA with Maven, How to do it...
    • developing, in IntelliJ idea / Developing a Spark application in IntelliJ IDEA with SBT, How to do it...
  • Spark master
    • about / How it works...
  • Spark RDD
    • about / Using Tachyon as an off-heap storage layer
    • challenges / Using Tachyon as an off-heap storage layer
  • Spark shell
    • exploring / Exploring the Spark shell, How to do it...
  • Spark SQL
    • about / Introduction
  • squared error function
    • about / Understanding cost function
  • Standalone mode
    • about / Introduction
    • reference link / See also
  • standalone mode
    • Spark, deploying / Deploying on a cluster in standalone mode, How to do it..., How it works...
  • start-all.sh script
    • about / How to do it...
  • start-master.sh script
    • about / How to do it...
  • start-slaves.sh script
    • about / How to do it...
  • stop-all.sh script
    • about / How to do it...
  • stop-master.sh script
    • about / How to do it...
  • stop-slaves.sh script
    • about / How to do it...
  • Streaming
    • about / Introduction
    • used, for word count / Word count using Streaming, How to do it...
    • with Kafka / Streaming using Kafka, How to do it..., There's more…
  • subgraph
    • about / Finding connected components
  • summary statistics
    • about / Calculating summary statistics
    • calculating / Calculating summary statistics, How to do it…
  • supervised learning
    • about / Introduction, Introduction
    • regression / Introduction
    • classification / Introduction
    • example / Introduction
  • support vector machines (SVM)
    • about / Introduction
  • support vectors
    • about / Doing binary classification using SVM
  • SVM
    • binary classification, performing / Doing binary classification using SVM, How to do it…

T

  • Tachyon
    • about / Introduction
    • using, as off-heap storage layer / Using Tachyon as an off-heap storage layer, How to do it...
    • reference link / See also
  • text classification
    • about / Doing classification with Naïve Bayes
  • topics
    • about / Streaming using Kafka
  • training data
    • about / Doing binary classification using SVM
  • Twitter data
    • live streaming / Streaming Twitter data, How to do it...

U

  • unsupervised learning
    • about / Introduction
  • use case, clustering
    • market segmentation / Clustering using k-means
    • social network analysis / Clustering using k-means
    • data center computing clusters / Clustering using k-means
    • astronomical data analysis / Clustering using k-means
    • real estate / Clustering using k-means
    • text analysis / Clustering using k-means

V

  • variance
    • versus bias / Doing linear regression with lasso
    • about / Doing linear regression with lasso
  • vectors
    • creating / Creating vectors, How it works...

W

  • Wikipedia page link data
    • URL / Getting ready
  • word count
    • with Streaming / Word count using Streaming, How to do it...
  • worker
    • about / How it works...

Y

  • YARN
    • about / Introduction, Deploying on a cluster with YARN
    • Spark, deploying on cluster / Deploying on a cluster with YARN, How to do it..., How it works…
    • yarn-client mode / How it works…
    • yarn-cluster mode / How it works…
    • configuration parameters / How it works…
  • young collection
    • about / Optimizing memory

Z

  • z density of house
    • about / Getting ready
lock icon The rest of the chapter is locked
arrow left Previous Section
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}