Packt+ | Advance your knowledge in tech

You're reading from Apache Spark 2: Data Processing and Real-Time Analytics Master complex big data processing, stream analytics, and machine learning with Apache Spark

Product type Course

Published in Dec 2018

Publisher Packt

ISBN-13 9781789959208

Length 616 pages

Edition 1st Edition

Languages

Processing

Tools

Apache Spark

Concepts

Big Data

Authors (7):

Sridhar Alla

Romeo Kienzler

Siamak Amirghodsi

Broderick Hall

Md. Rezaul Karim

Meenakshi Rajendran

Shuen Mei

+3 more

View More author details

Table of Contents (23) Chapters

Title Page

About Packt

Contributors

Preface

1. A First Taste and What's New in Apache Spark V2

2. Apache Spark Streaming FREE CHAPTER

3. Structured Streaming

4. Apache Spark MLlib

5. Apache SparkML

6. Apache SystemML

7. Apache Spark GraphX

8. Spark Tuning

9. Testing and Debugging Spark

10. Practical Machine Learning with Spark Using Scala

11. Spark's Three Data Musketeers for Machine Learning - Perfect Together

12. Common Recipes for Implementing a Robust Machine Learning System

13. Recommendation Engine that Scales with Spark

14. Unsupervised Clustering with Apache Spark 2.0

15. Implementing Text Analytics with Spark 2.0 ML Library

16. Spark Streaming and Machine Learning Library

1. Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

A

abstract syntax tree (AST) / High-level operators are generated
ACM Digital Library
- reference / There's more...
Alluxio / Hadoop Distributed File System
alternating least square (ALS) / Introduction
alternating least squares (ALS) / An example - alternating least squares
Apache Giraph
- reference / Overview
Apache Mesos / Apache Mesos
Apache Spark
- cluster design / Cluster design
- windowing, improving / How Apache Spark improves windowing
/ Introduction, Apache Spark
Apache Spark 2.0
- used, for running first program with IntelliJ IDE / Running your first program using Apache Spark 2.0 with the IntelliJ IDE
Apache Spark GraphX
- overview / Overview
Apache Spark GraphX module / Spark graph processing
ApacheSparkML pipelines
- components / The concept of pipelines
Apache Spark V2
- changes / What's new in Apache Spark V2?
Apache Spark V2.2
- unsupported operations / Increased performance with good old friends
Apache Streaming
- data sources / Overview
Apache SystemML
- about / Spark machine learning, Why do we need just another library?
- history / The history of Apache SystemML
- performance measurements / Performance measurements
- working / Apache SystemML in action
ApacheSystemML architecture
- about / ApacheSystemML architecture
- language parsing / Language parsing
- high-level operators, generating / High-level operators are generated
- low-level operators, optimizing / How low-level operators are optimized on
Apache YARN / Apache YARN
artificial neural networks (ANNs)
- about / Artificial neural networks
- working / ANN in practice

B

basic statistical API, Spark
- used, for building algorithms / Spark's basic statistical API to help you build your own algorithms
batch time / Overview
binary classification model
- evaluating, Spark 2.0 used / Binary classification model evaluation using Spark 2.0
Breeze graphics
- reference / There's more...

C

C++ / Scala
Cassandra / Hadoop Distributed File System
Catalyst / There's more...
Ceph / Hadoop Distributed File System
checkpointing / Checkpointing
classification
- with Naive Bayes / Classification with Naive Bayes
classifiers / How it works...
Cloud / Cloud
cloud-based deployments / Cloud-based deployments
Cloud9 library
- reference / How to do it...
clustering
- with K-means / Clustering with K-Means
clustering systems / Introduction
cluster management / Cluster management
cluster manager options, Apache Spark
- local / Local
- standalone / Standalone
cluster structure / The cluster structure
coding / Coding
collaborative filtering
- about / Collaborative filtering
- used, for building scalable recommendation engine / Building a scalable recommendation engine using collaborative filtering in Spark 2.0, There's more...
common mistakes, Spark app development
- about / Common mistakes in Spark app development
- application failure / Application failure
- slow jobs / Slow jobs or unresponsiveness
- unresponsiveness / Slow jobs or unresponsiveness
components, ApacheSparkML pipelines
- DataFrame / The concept of pipelines
- transformer / The concept of pipelines
- estimator / The concept of pipelines, Estimators
- pipeline / The concept of pipelines, Transformers, Pipelines
- parameter / The concept of pipelines
confusion matrix
- reference / There's more...
content filtering / Content filtering
continuous applications
- about / The concept of continuous applications
- unification / True unification - same code, same engine
- controlling / Controlling continuous applications
continuous bag of words (CBOW) / How it works...
cost-based optimizer, for machine learning algorithms
- about / A cost-based optimizer for machine learning algorithms
- alternating least squares, example / An example - alternating least squares
count-based windows / How streaming engines use windowing
CrossValidation / CrossValidation

D

data
- normalizing, with Spark / Normalizing data with Spark
- streaming / Streaming data and debugging with queueStream
Databricks
- reference / Dataset - a high-level unifying Data API
data classification
- Gaussian Mixture, using / Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data
- expectation maximization (EM), using / Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data
DataFrame / The concept of pipelines
DataFrame-based machine learning API / Spark machine learning
DataFrameReader / There's more...
DataFrames
- about / Introduction, DataFrame - a natural evolution to unite API and SQL via a high-level API
- creating, from Scala data structures / Creating DataFrames from Scala data structures
- reference / See also
- programmatic operation, without SQL / Operating on DataFrames programmatically without SQL
- loading, from external source / Loading DataFrames and setup from an external source
- documentation reference / See also, See also
- using, with SparkSQL / Using DataFrames with standard SQL language - SparkSQL
- streaming, for real-time machine learning / Streaming DataFrames for real-time machine learning
DataFrameWriter
- reference / There's more...
data locality / Data locality
Data Mining Group (DMG) / There's more...
data serialization
- about / Data serialization
- Java serialization / Data serialization
- Kryo serialization / Data serialization
dataset
- about / Dataset - a high-level unifying Data API
- strong type safety / Dataset - a high-level unifying Data API
- Tungsten Memory Management, enabling / Dataset - a high-level unifying Data API
- encoders / Dataset - a high-level unifying Data API
- catalyst optimizer friendly / Dataset - a high-level unifying Data API
- creating, from RDDs / Creating and using Datasets from RDDs and back again
- using, from RDDs / Creating and using Datasets from RDDs and back again
- streaming, for real-time machine learning / Streaming Datasets for real-time machine learning
Dataset API
- working with, Scala Sequence used / Working with the Dataset API using a Scala Sequence
- used, for performing operations / Common operations with the new Dataset API
- reference / See also
Dataset API and SQL
- used, for working with JSON / Working with JSON using the Dataset API and SQL together
data sources, for practical machine learning
- identifying / Identifying data sources for practical machine learning
data splitting
- for training / Splitting data for training and testing
- for testing / Splitting data for training and testing
DataStreamReader
- reference / See also
DataStreamWriter
- reference / See also
datatypes
- documentation, reference / There's more...
- reference / There's more...
debugging, Spark applications
- Spark Standalone / Debugging Spark applications using logs
- YARN / Debugging Spark applications using logs
- log4j, logging with / Logging with log4j with Spark, Logging with log4j with Spark recap
- about / Debugging Spark applications, Debugging the Spark application
- on Eclipse, as Scala debug / Debugging Spark application on Eclipse as Scala debug
- Spark jobs running as local and standalone mode, debugging / Debugging Spark jobs running as local and standalone mode
- on YARN or Mesos cluster / Debugging Spark applications on YARN or Mesos cluster
- SBT, using / Debugging Spark application using SBT
Deeplearning4j
- about / Spark machine learning
DenseVector API
- reference / See also
dimensionality reduction systems / Introduction
directed acyclic graph (DAG) / Why on Apache Spark?
Directed Acyclic Graph (DAG) / Jobs
Dirichlet
- reference / Topic modeling with Latent Dirichlet allocation in Spark 2.0
Discretized Stream (DStream) / Introduction
distributed environment, Spark
- testing in / Testing in a distributed environment
- about / Distributed environment
- issues / Issues in a distributed system
- software testing challenges / Challenges of software testing in a distributed environment
domain objects
- used, for functional programming with Dataset API / Functional programming with the Dataset API using domain objects
DSL (domain specific language) / An example - alternating least squares
dynamic rewrites / High-level operators are generated

E

edges / Spark graph processing
errors / Errors and recovery
estimators
- about / The concept of pipelines
- RandomForestClassifier / RandomForestClassifier
ETL (Extract Transform Load) / Naive Bayes in practice
event time
- versus processing time / How Apache Spark improves windowing
exactly-once delivery guarantee
- achieving / How transparent fault tolerance and exactly-once delivery guarantee is achieved, State versioning guarantees consistent results after reruns
expectation maximization (EM)
- used, for data classification / Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data
extended ecosystem / Extended ecosystem
external data sources
- used, for creating RDDs / Creating RDDs with Spark 2.0 using external data sources
Extract Transform Load (ETL) / Spark machine learning

F

Facebook / Overview
features / VectorAssembler
FIFO (first-in first-out) scheduling / Standalone
file streams / File streams
filter() API
- used, for transforming RDDs / Transforming RDDs with Spark 2.0 using the filter() API
- reference / See also
firing mechanism (F(Net) / Artificial neural networks
first-in first-out (FIFO) / Example - connection to a MQTT message broker
flatMap() API
- used, for transforming RDDs / Transforming RDDs with the super useful flatMap() API
Flume
- about / Flume
- reference / Flume
- working / Flume
functional programming, with Dataset API
- domain objects, using / Functional programming with the Dataset API using domain objects

G

garbage collection (GC) / What's new in Apache Spark V2?
Gaussian Mixture
- used, for data classification / Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data
- reference / See also
GaussianMixtureModel
- reference / See also
General Electric (GE) / How it works...
GlusterFS / Hadoop Distributed File System
GPFS (General Purpose File System) / Hadoop Distributed File System
graph / Spark graph processing, Overview
graph analytics/processing, with GraphX
- about / Graph analytics/processing with GraphX
- raw data / The raw data
- graph, creating / Creating a graph
- counting example / Example 1 – counting
- filtering example / Example 2 – filtering
- PageRank example / Example 3 – PageRank
- triangle counting example / Example 4 – triangle counting
- connected components example / Example 5 – connected components
graphics
- adding, to Spark program / How to add graphics to your Spark program
graph processing / Spark graph processing
groupBy() method
- used, for RDD transformation/aggregation / RDD transformation/aggregation with groupBy() and reduceByKey()
- reference / There's more...
Guava
- about / There's more...
- reference / There's more...

H

Hadoop / The development environment, Apache Spark
Hadoop Distributed File System / Hadoop Distributed File System
Hadoop Distributed File System (HDFS)
- reference / The development environment
Hadoop runtime
- configuring, on Windows / Configuring Hadoop runtime on Windows
HeartbeatReceiver RPC endpoint / Executors
hierarchical clustering approaches
- divisive / How it works...
- agglomerative / How it works...
- reference / See also
high-level operators (HOPs) / High-level operators are generated
hyperparameters / The concept of pipelines
hyperparameter tuning / Hyperparameter tuning

I

IDEA documentation
- reference / Debugging Spark application using SBT
IEEE Digital Library
- reference / There's more...
implicit input, for training
- dealing with / Dealing with implicit input for training
Infrastructure as a Service (IaaS) / Cloud-based deployments
inputs / ANN in practice
IntelliJ
- configuration, for working with Spark / Configuring IntelliJ to work with Spark and run Spark ML sample codes
IntelliJ IDE
- Apache Spark 2.0, used for running first program / Running your first program using Apache Spark 2.0 with the IntelliJ IDE
internal data sources
- used, for creating RDDs / Creating RDDs with Spark 2.0 using internal data sources
Internet of Things (IoT) / Example - connection to a MQTT message broker
Iris data
- downloading, for unsupervised classification / Downloading and understanding the famous Iris data for unsupervised classification

J

Java / Scala
Javascript Object Notation (JSON)
- working with, Datataset API and SQL used / Working with JSON using the Dataset API and SQL together
- about / How it works...
Java Virtual Machine (JVM) / What's new in Apache Spark V2?, Hadoop Distributed File System
JBOD (just a bunch of disks) approach / Cluster design
JFreeChart
- reference / See also
JFreeChart JAR files
- reference / How to do it...
JMLR
- reference / There's more...

K

K-Means
- working / K-Means in practice
k-means streaming
- for real-time on-line classifier / Streaming KMeans for a real-time on-line classifier
Kafka
- about / Kafka
- reference / Kafka
- using / Kafka
Kaggle competition, winning with Apache SparkML
- about / Winning a Kaggle competition with Apache SparkML
- data preparation / Data preparation
- feature engineering / Feature engineering
- feature engineering pipeline, testing / Testing the feature engineering pipeline
- machine learning model, training / Training the machine learning model
- model evaluation / Model evaluation
- CrossValidation / CrossValidation and hyperparameter tuning
- hyperparameter tuning / CrossValidation and hyperparameter tuning
- evaluator, used for assessing quality of cross-validated model / Using the evaluator to assess the quality of the cross-validated and tuned model
Kaggle competitions
- reference / There's more...
KMeans
- bisecting / Bisecting KMeans, the new kid on the block in Spark 2.0
- bisecting, reference / There's more...
- streaming, for data classification / Streaming KMeans to classify data in near real-time
- streaming, reference / See also
KMeans() object
- reference / See also
KMeans classifying system
- building, in Spark 2.0 / Building a KMeans classifying system in Spark 2.0, How it works...
- KMeans (Lloyd Algorithm) / KMeans (Lloyd Algorithm)
- K-Means++ (Arthur's Algorithm) / KMeans++ (Arthur's algorithm)
- K-Means|| (pronounced as K-Means Parallel) / KMeans|| (pronounced as KMeans Parallel)
KMeansModel() object
- reference / See also
Kolmogorov-Smirnov (KS) / There's more...

L

LabeledPoint data structure
- for Spark ML / LabeledPoint data structure for Spark ML
- reference / See also
last-in first-out (LIFO) / Example - connection to a MQTT message broker
late data / How Apache Spark improves windowing
Latent Dirichlet Allocation (LDA)
- used, for classifying documents and text into topics / Latent Dirichlet Allocation (LDA) to classify documents and text into topics, See also
- about / Introduction
latent factor models techniques
- Single Value Decomposition (SVD) / Latent factor models techniques
- Stochastic Gradient Decent (SGD) / Latent factor models techniques
- Alternating Least Square (ALS) / Latent factor models techniques
latent factors / Building a scalable recommendation engine using collaborative filtering in Spark 2.0
Latent Semantic Analysis (LSA)
- used, for text analytics with Spark 2.0 / Using Latent Semantic Analysis for text analytics with Spark 2.0
LDAModel
- reference / See also
libraries / Software versions and libraries used in this book
linear regression
- streaming, for real-time regression / Streaming linear regression for a real-time regression
Lustre / Hadoop Distributed File System

M

machine learning
- about / Introduction, Machine learning
- data sources / See also
machine learning library (MLlib) / Apache Spark
MapR file system / Hadoop Distributed File System
Maven-based build
- reference / Method 3: Making life easier with Spark testing base
Maven repository
- Spark installation, reference / See also
Mean Squared Error (MSE) / Regression model evaluation using Spark 2.0
memory tuning
- about / Memory tuning
- memory usage / Memory usage and management
- memory management / Memory usage and management
- data structures, tuning / Tuning the data structures
- serialized RDD storage / Serialized RDD storage
- garbage collection tuning / Garbage collection tuning
- level of parallelism / Level of parallelism
- broadcasting / Broadcasting
- data locality / Data locality
metrics
- reference / There's more...
MinMaxScaler
- reference / See also
ML pipelines
- creating, for real-life machine learning applications / ML pipelines for real-life machine learning applications
model evaluation / Model evaluation
model export facility
- exploring / New model export and PMML markup in Spark 2.0
MovieLens dataset
- reference / How it works...
MQTT / Spark Streaming
MQTT (Message Queue Telemetry Transport) / Example - connection to a MQTT message broker
MQTT message broker connection
- example / Example - connection to a MQTT message broker
multiclass classification metrics
- reference / See also
multiclass classification model
- evaluating, Spark 2.0 used / Multiclass classification model evaluation using Spark 2.0
multilabel classification model
- evaluating, Spark 2.0 used / Multilabel classification model evaluation using Spark 2.0
multilabel metrics
- reference / There's more...
multivariate statistical summary
- reference / See also, See also

N

Naive Bayes
- using / Theory on Classification
- working / Naive Bayes in practice
netcat
- reference / TCP stream
Net function / Artificial neural networks
Neural Net (NN) / How it works...
New GaussianMixture() parameter / New GaussianMixture()
nodes / Spark graph processing

O

OneHotEncoder / OneHotEncoder
OOM (Out of Memory) messages
- avoiding / Memory
optimization techniques
- about / Optimization techniques
- data serialization / Data serialization
- memory tuning / Memory tuning
Out Of Memory (OOM) / Common mistakes in Spark app development

P

paired key-value RDDs
- used, for join transformation / Join transformation with paired key-value RDDs
- used, for reducing transformation / Reduce and grouping transformation with paired key-value RDDs
- used, for grouping transformation / Reduce and grouping transformation with paired key-value RDDs
parameters / The concept of pipelines
partitions / RDDs - what started it all...
pattern matching / There's more...
performance / Performance
performance-related problems, Spark
- reference / Cloud
PIC (Power Iteration Clustering) / How it works...
Pima Diabetes data
- downloading, for supervised classification / Downloading Pima Diabetes data for supervised classification
pipelines / What does the new API look like?, The concept of pipelines, Pipelines
Platform as a Service (PaaS) / Cloud-based deployments
PMMLExportable API
- reference / See also
PowerIterationClustering() constructor
- reference / See also
Power Iteration Clustering (PIC)
- used, for classifying graph vertices / Classifying the vertices of a graph using Power Iteration Clustering (PIC) in Spark 2.0
PowerIterationClusteringModel() constructor
- reference / See also
practical machine learning, with Spark
- Scala, using / There's more...
Predictive Model Markup Language (PMML)
- using / New model export and PMML markup in Spark 2.0
Priority Queue / Example - connection to a MQTT message broker
processing time
- versus event time / How Apache Spark improves windowing

Q

Quantcast / Hadoop Distributed File System
quasiquotes / There's more...
queueStream
- used, for debugging / Streaming data and debugging with queueStream
- reference / How it works...

R

R
- reference / How it works...
RandomForestClassifier / RandomForestClassifier
randomSplit()
- reference / See also
RankingMetrics API
- documentation link / There's more...
RDDs
- about / RDDs - what started it all...
- JdbcRDD / RDDs - what started it all...
- Vertex RDD / RDDs - what started it all...
- HadoopRDD / RDDs - what started it all...
- UnionRDD / RDDs - what started it all...
- RandomRDD / RDDs - what started it all...
- creating, internal data sources used / Creating RDDs with Spark 2.0 using internal data sources
- creating, external data sources used / Creating RDDs with Spark 2.0 using external data sources
- transforming, filter() API used / Transforming RDDs with Spark 2.0 using the filter() API
- transforming, flatMap() API / Transforming RDDs with the super useful flatMap() API
- transforming, with set operation APIs / Transforming RDDs with set operation APIs
- transforming, with zip() API / Transforming RDDs with the zip() API
- datasets, creating / Creating and using Datasets from RDDs and back again
- datasets, using / Creating and using Datasets from RDDs and back again
- versus Data Frame / Creating and using RDD versus DataFrame versus Dataset from a text file in Spark 2.0
- versus Dataset from text file / Creating and using RDD versus DataFrame versus Dataset from a text file in Spark 2.0
- documentation, reference / See also
real-life machine learning applications
- ML pipelines, creating / ML pipelines for real-life machine learning applications
real-life Spark ML project
- dump of Wikipedia, downloading / Downloading a complete dump of Wikipedia for a real-life Spark ML project
real-time machine learning
- structured streaming / Structured streaming for near real-time machine learning
- DataFrames, streaming / Streaming DataFrames for real-time machine learning
- Datasets, streaming / Streaming Datasets for real-time machine learning
real-time on-line classifier
- k-means streaming / Streaming KMeans for a real-time on-line classifier
real-time regression
- linear regression, streaming / Streaming linear regression for a real-time regression
recommendation engines / Introduction
recommendation system
- about / Introduction
- movie data details, exploring / Exploring the movies data details for the recommendation system in Spark 2.0
- ratings data details, exploring / Exploring the ratings data details for the recommendation system in Spark 2.0, There's more...
recovery / Errors and recovery
reduceByKey() method
- used, for RDD transformation/aggregation / RDD transformation/aggregation with groupBy() and reduceByKey()
RegressionMetrics facility / Regression model evaluation using Spark 2.0
relationships / Spark graph processing
Resilient Distributed Datasets (RDDs) / Machine learning

S

sample ML code
- running, from Spark / Running a sample ML code from Spark
sbt tool / The development environment
Scala / The development environment, Introduction, Scala
scalable recommendation engine
- required data, setting up / Setting up the required data for a scalable recommendation engine in Spark 2.0
- building, with collaborative filtering / Building a scalable recommendation engine using collaborative filtering in Spark 2.0, There's more...
Scala Breeze library
- used, for creating graphics in Spark 2.0 / Using the Scala Breeze library to do graphics in Spark 2.0
Scala data structures
- DataFrames, creating / Creating DataFrames from Scala data structures
Scala pattern matching
- reference / There's more...
Scala quasiquotes
- reference / There's more...
Scala Sequence
- used, for working with Dataset API / Working with the Dataset API using a Scala Sequence
ScalaTest's assertions
- reference / Testing Scala methods
Scala test guideline
- reference / Testing Scala methods
scikit-learn
- reference / There's more...
session-based windows / How streaming engines use windowing
set operation APIs
- used, for transforming RDDs / Transforming RDDs with set operation APIs
Single Value Decomposition (SVD)
- reference / See also
skip-gram model with negative sampling (SGNS) / There's more...
sliding windows / How streaming engines use windowing
Software as a Service (SaaS) / Cloud-based deployments
software versions / Software versions and libraries used in this book
Spark
- testing, in distributed environment / Testing in a distributed environment
- reference / Software versions and libraries used in this book
- sample ML code, running / Running a sample ML code from Spark
- download link / There's more...
- used, for normalizing data / Normalizing data with Spark
- tools / Introduction
- term frequency, doing / Doing term frequency with Spark - everything that counts
- used, for displaying similar words / Displaying similar words with Spark using Word2Vec
Spark 1.6 streaming
- reference / There's more...
Spark 2.0
- access to SarkContext vis-a-vis SparkSession object, obtaining / Getting access to SparkContext vis-a-vis SparkSession object in Spark 2.0
- regression model, evaluating / Regression model evaluation using Spark 2.0
- used, for multiclass classification model evaluation / Multiclass classification model evaluation using Spark 2.0
- used, for multilabel classification model evaluation / Multilabel classification model evaluation using Spark 2.0
- Scala Breeze library, used for creating graphics / Using the Scala Breeze library to do graphics in Spark 2.0
- KMeans classifying system, building / Building a KMeans classifying system in Spark 2.0
- KMeans, bisecting / Bisecting KMeans, the new kid on the block in Spark 2.0
- Latent Semantic Analysis, used for text analysis / Using Latent Semantic Analysis for text analytics with Spark 2.0
- topic modeling, with Latent Dirichlet allocation / Topic modeling with Latent Dirichlet allocation in Spark 2.0
Spark 2.0 ML documentation
- reference / See also
Spark 2.0 MLlib
- documentation link / See also
Spark 2.0+
- Spark cluster, accessing / Getting access to Spark cluster in Spark 2.0
Spark applications
- visualizing, web UI used / Visualizing Spark application using web UI
- running, observing / Observing the running and completed Spark jobs
- completed Spark jobs, observing / Observing the running and completed Spark jobs
- debugging, logs used / Debugging Spark applications using logs
- testing / Testing Spark applications, Testing Spark applications
- Scala methods, testing / Testing Scala methods
- unit testing / Unit testing
- testing, with Scala JUnit test / Method 1: Using Scala JUnit test
- Scala code, testing with FunSuite / Method 2: Testing Scala code using FunSuite
- Spark testing base / Method 3: Making life easier with Spark testing base
- debugging / Debugging Spark applications
Spark cluster
- accessing, in Spark 2.0+ / Getting access to Spark cluster in Spark 2.0
Spark cluster pre-Spark 2.0
- access, obtaining / Getting access to Spark cluster pre-Spark 2.0
Spark configuration
- about / Spark configuration
- Spark properties / Spark properties
- environment variables / Environmental variables
- logging / Logging
SparkContext
- documentation reference / See also
- reference / See also
SparkContext vis-a-vis SparkSession object
- access, obtaining / Getting access to SparkContext vis-a-vis SparkSession object in Spark 2.0
Spark graph processing / Spark graph processing
Spark jobs, monitoring
- about / Monitoring Spark jobs
- Spark web interface / Spark web interface
Spark machine learning / Spark machine learning
SparkML / Spark machine learning
Spark ML
- LabeledPoint data structure / LabeledPoint data structure for Spark ML
SparkML API / What does the new API look like?
Spark MLlib
- architecture / Architecture
- development environment / The development environment
Spark ML sample codes
- running / Configuring IntelliJ to work with Spark and run Spark ML sample codes
Spark program
- graphics, adding / How to add graphics to your Spark program
SparkSession
- reference / See also
SparkSQL
- DataFrames, using / How to do it...
Spark SQL / Spark SQL
Spark Stream Context (SSC) / Overview
Spark streaming
- about / Introduction
- reference / There's more...
Spark Streaming / Spark Streaming
Spark testing base
- reference / Method 3: Making life easier with Spark testing base
Spark web interface
- about / Spark web interface
- Jobs / Jobs
- Stages / Stages
- Storage / Storage
- Environment / Environment
- Executors / Executors
- SQL / SQL
sparse matrix / An example - alternating least squares
SparseVector API
- reference / See also
sparse vector representations / Feature engineering
specialized datasets
- reference / See also
static rewrites / High-level operators are generated
stemming
- reference / How it works...
streaming engines
- windowing, using / How streaming engines use windowing
streaming regression
- wine quality data, downloading / Downloading wine quality data for streaming regression
Streaming sources
- about / Streaming sources
- TCP stream / TCP stream
- file streams / File streams
- Flume / Flume
- Kafka / Kafka
stream life cycle management / More on stream life cycle management
stream processing / Spark Streaming
string indexer / String indexer
structured streaming
- for near real-time machine learning / Structured streaming for near real-time machine learning
- reference / See also
supervised classification
- Pima Diabetes data, downloading / Downloading Pima Diabetes data for supervised classification
Support Vector Machine (SVM)
- about / Normalizing data with Spark

T

Tachyon / Hadoop Distributed File System
TCP stream / TCP stream
TDD (test-driven development) / Testing Scala methods
text analysis / Introduction
time-based windows / How streaming engines use windowing
TinkerPop
- reference / Overview
topic modeling
- with Latent Dirichlet allocation / Topic modeling with Latent Dirichlet allocation in Spark 2.0
transformers
- about / The concept of pipelines, Transformers
- string indexer / String indexer
- OneHotEncoder / OneHotEncoder
- VectorAssembler / VectorAssembler
transparent fault tolerance
- achieving / How transparent fault tolerance and exactly-once delivery guarantee is achieved, Idempotent sinks prevent data duplication
tumbling windows / How streaming engines use windowing

U

UDF (user-defined function) / Memory usage and management
unit testing
- Spark applications / Unit testing
unit vectors
- reference / There's more...
unsupervised classification
- Iris data, downloading / Downloading and understanding the famous Iris data for unsupervised classification
unsupervised learning / Introduction

V

VectorAssembler / VectorAssembler
vertices, graph
- classification, with Power Iteration Clustering (PIC) / Classifying the vertices of a graph using Power Iteration Clustering (PIC) in Spark 2.0
Virtual Machine (VM) / Challenges of software testing in a distributed environment

W

wdivmm (weighted divide matrix multiplication) / High-level operators are generated
weighted local neighborhood / Neighborhood method
Wikipedia dump
- downloading, for real-life Spark ML project / Downloading a complete dump of Wikipedia for a real-life Spark ML project
windowing / Windowing
Windows
- Hadoop runtime, configuring / Configuring Hadoop runtime on Windows
wine quality data
- downloading, for streaming / Downloading wine quality data for streaming regression
Within Set Sum of Squared Errors (WSSSE) / K-Means in practice, How to do it...
Word2Vec
- used, for displaying similar words with Spark / Displaying similar words with Spark using Word2Vec
- reference / There's more..., See also
World Wide Web (WWW) / Testing in a distributed environment
Write Ahead Log (WAL) / How transparent fault tolerance and exactly-once delivery guarantee is achieved

Z

ZeroMQ / Spark Streaming
zip() API
- used, for transforming RDDs / Transforming RDDs with the zip() API

The rest of the chapter is locked

You're reading from Apache Spark 2: Data Processing and Real-Time Analytics Master complex big data processing, stream analytics, and machine learning with Apache Spark

Table of Contents (23) Chapters

Index

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

Z

Unlock this book and the full library FREE for 7 days

Authors (7)

Personalised recommendations for you