Index
A
- abstract syntax tree (AST) / High-level operators are generated
- ACM Digital Library
- reference / There's more...
- Alluxio / Hadoop Distributed File System
- alternating least square (ALS) / Introduction
- alternating least squares (ALS) / An example - alternating least squares
- Apache Giraph
- reference / Overview
- Apache Mesos / Apache Mesos
- Apache Spark
- cluster design / Cluster design
- windowing, improving / How Apache Spark improves windowing
- Apache Spark 2.0
- used, for running first program with IntelliJ IDE / Running your first program using Apache Spark 2.0 with the IntelliJ IDE
- Apache Spark GraphX
- overview / Overview
- Apache Spark GraphX module / Spark graph processing
- ApacheSparkML pipelines
- components / The concept of pipelines
- Apache Spark V2
- changes / What's new in Apache Spark V2?
- Apache Spark V2.2
- unsupported operations / Increased performance with good old friends
- Apache Streaming
- data sources / Overview
- Apache SystemML
- about / Spark machine learning, Why do we need just another library?
- history / The history of Apache SystemML
- performance measurements / Performance measurements
- working / Apache SystemML in action
- ApacheSystemML architecture
- about / ApacheSystemML architecture
- language parsing / Language parsing
- high-level operators, generating / High-level operators are generated
- low-level operators, optimizing / How low-level operators are optimized on
- Apache YARN / Apache YARN
- artificial neural networks (ANNs)
- about / Artificial neural networks
- working / ANN in practice
B
- basic statistical API, Spark
- used, for building algorithms / Spark's basic statistical API to help you build your own algorithms
- batch time / Overview
- binary classification model
- evaluating, Spark 2.0 used / Binary classification model evaluation using Spark 2.0
- Breeze graphics
- reference / There's more...
C
- C++ / Scala
- Cassandra / Hadoop Distributed File System
- Catalyst / There's more...
- Ceph / Hadoop Distributed File System
- checkpointing / Checkpointing
- classification
- with Naive Bayes / Classification with Naive Bayes
- classifiers / How it works...
- Cloud / Cloud
- cloud-based deployments / Cloud-based deployments
- Cloud9 library
- reference / How to do it...
- clustering
- with K-means / Clustering with K-Means
- clustering systems / Introduction
- cluster management / Cluster management
- cluster manager options, Apache Spark
- local / Local
- standalone / Standalone
- cluster structure / The cluster structure
- coding / Coding
- collaborative filtering
- about / Collaborative filtering
- used, for building scalable recommendation engine / Building a scalable recommendation engine using collaborative filtering in Spark 2.0, There's more...
- common mistakes, Spark app development
- about / Common mistakes in Spark app development
- application failure / Application failure
- slow jobs / Slow jobs or unresponsiveness
- unresponsiveness / Slow jobs or unresponsiveness
- components, ApacheSparkML pipelines
- DataFrame / The concept of pipelines
- transformer / The concept of pipelines
- estimator / The concept of pipelines, Estimators
- pipeline / The concept of pipelines, Transformers, Pipelines
- parameter / The concept of pipelines
- confusion matrix
- reference / There's more...
- content filtering / Content filtering
- continuous applications
- about / The concept of continuous applications
- unification / True unification - same code, same engine
- controlling / Controlling continuous applications
- continuous bag of words (CBOW) / How it works...
- cost-based optimizer, for machine learning algorithms
- about / A cost-based optimizer for machine learning algorithms
- alternating least squares, example / An example - alternating least squares
- count-based windows / How streaming engines use windowing
- CrossValidation / CrossValidation
D
- data
- normalizing, with Spark / Normalizing data with Spark
- streaming / Streaming data and debugging with queueStream
- Databricks
- reference / Dataset - a high-level unifying Data API
- data classification
- Gaussian Mixture, using / Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data
- expectation maximization (EM), using / Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data
- DataFrame / The concept of pipelines
- DataFrame-based machine learning API / Spark machine learning
- DataFrameReader / There's more...
- DataFrames
- about / Introduction, DataFrame - a natural evolution to unite API and SQL via a high-level API
- creating, from Scala data structures / Creating DataFrames from Scala data structures
- reference / See also
- programmatic operation, without SQL / Operating on DataFrames programmatically without SQL
- loading, from external source / Loading DataFrames and setup from an external source
- documentation reference / See also, See also
- using, with SparkSQL / Using DataFrames with standard SQL language - SparkSQL
- streaming, for real-time machine learning / Streaming DataFrames for real-time machine learning
- DataFrameWriter
- reference / There's more...
- data locality / Data locality
- Data Mining Group (DMG) / There's more...
- data serialization
- about / Data serialization
- Java serialization / Data serialization
- Kryo serialization / Data serialization
- dataset
- about / Dataset - a high-level unifying Data API
- strong type safety / Dataset - a high-level unifying Data API
- Tungsten Memory Management, enabling / Dataset - a high-level unifying Data API
- encoders / Dataset - a high-level unifying Data API
- catalyst optimizer friendly / Dataset - a high-level unifying Data API
- creating, from RDDs / Creating and using Datasets from RDDs and back again
- using, from RDDs / Creating and using Datasets from RDDs and back again
- streaming, for real-time machine learning / Streaming Datasets for real-time machine learning
- Dataset API
- working with, Scala Sequence used / Working with the Dataset API using a Scala Sequence
- used, for performing operations / Common operations with the new Dataset API
- reference / See also
- Dataset API and SQL
- used, for working with JSON / Working with JSON using the Dataset API and SQL together
- data sources, for practical machine learning
- identifying / Identifying data sources for practical machine learning
- data splitting
- for training / Splitting data for training and testing
- for testing / Splitting data for training and testing
- DataStreamReader
- reference / See also
- DataStreamWriter
- reference / See also
- datatypes
- documentation, reference / There's more...
- reference / There's more...
- debugging, Spark applications
- Spark Standalone / Debugging Spark applications using logs
- YARN / Debugging Spark applications using logs
- log4j, logging with / Logging with log4j with Spark, Logging with log4j with Spark recap
- about / Debugging Spark applications, Debugging the Spark application
- on Eclipse, as Scala debug / Debugging Spark application on Eclipse as Scala debug
- Spark jobs running as local and standalone mode, debugging / Debugging Spark jobs running as local and standalone mode
- on YARN or Mesos cluster / Debugging Spark applications on YARN or Mesos cluster
- SBT, using / Debugging Spark application using SBT
- Deeplearning4j
- about / Spark machine learning
- DenseVector API
- reference / See also
- dimensionality reduction systems / Introduction
- directed acyclic graph (DAG) / Why on Apache Spark?
- Directed Acyclic Graph (DAG) / Jobs
- Dirichlet
- reference / Topic modeling with Latent Dirichlet allocation in Spark 2.0
- Discretized Stream (DStream) / Introduction
- distributed environment, Spark
- testing in / Testing in a distributed environment
- about / Distributed environment
- issues / Issues in a distributed system
- software testing challenges / Challenges of software testing in a distributed environment
- domain objects
- used, for functional programming with Dataset API / Functional programming with the Dataset API using domain objects
- DSL (domain specific language) / An example - alternating least squares
- dynamic rewrites / High-level operators are generated
E
- edges / Spark graph processing
- errors / Errors and recovery
- estimators
- about / The concept of pipelines
- RandomForestClassifier / RandomForestClassifier
- ETL (Extract Transform Load) / Naive Bayes in practice
- event time
- versus processing time / How Apache Spark improves windowing
- exactly-once delivery guarantee
- achieving / How transparent fault tolerance and exactly-once delivery guarantee is achieved, State versioning guarantees consistent results after reruns
- expectation maximization (EM)
- used, for data classification / Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data
- extended ecosystem / Extended ecosystem
- external data sources
- used, for creating RDDs / Creating RDDs with Spark 2.0 using external data sources
- Extract Transform Load (ETL) / Spark machine learning
F
- Facebook / Overview
- features / VectorAssembler
- FIFO (first-in first-out) scheduling / Standalone
- file streams / File streams
- filter() API
- used, for transforming RDDs / Transforming RDDs with Spark 2.0 using the filter() API
- reference / See also
- firing mechanism (F(Net) / Artificial neural networks
- first-in first-out (FIFO) / Example - connection to a MQTT message broker
- flatMap() API
- used, for transforming RDDs / Transforming RDDs with the super useful flatMap() API
- Flume
- about / Flume
- reference / Flume
- working / Flume
- functional programming, with Dataset API
- domain objects, using / Functional programming with the Dataset API using domain objects
G
- garbage collection (GC) / What's new in Apache Spark V2?
- Gaussian Mixture
- used, for data classification / Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data
- reference / See also
- GaussianMixtureModel
- reference / See also
- General Electric (GE) / How it works...
- GlusterFS / Hadoop Distributed File System
- GPFS (General Purpose File System) / Hadoop Distributed File System
- graph / Spark graph processing, Overview
- graph analytics/processing, with GraphX
- about / Graph analytics/processing with GraphX
- raw data / The raw data
- graph, creating / Creating a graph
- counting example / Example 1 – counting
- filtering example / Example 2 – filtering
- PageRank example / Example 3 – PageRank
- triangle counting example / Example 4 – triangle counting
- connected components example / Example 5 – connected components
- graphics
- adding, to Spark program / How to add graphics to your Spark program
- graph processing / Spark graph processing
- groupBy() method
- used, for RDD transformation/aggregation / RDD transformation/aggregation with groupBy() and reduceByKey()
- reference / There's more...
- Guava
- about / There's more...
- reference / There's more...
H
- Hadoop / The development environment, Apache Spark
- Hadoop Distributed File System / Hadoop Distributed File System
- Hadoop Distributed File System (HDFS)
- reference / The development environment
- Hadoop runtime
- configuring, on Windows / Configuring Hadoop runtime on Windows
- HeartbeatReceiver RPC endpoint / Executors
- hierarchical clustering approaches
- divisive / How it works...
- agglomerative / How it works...
- reference / See also
- high-level operators (HOPs) / High-level operators are generated
- hyperparameters / The concept of pipelines
- hyperparameter tuning / Hyperparameter tuning
I
- IDEA documentation
- reference / Debugging Spark application using SBT
- IEEE Digital Library
- reference / There's more...
- implicit input, for training
- dealing with / Dealing with implicit input for training
- Infrastructure as a Service (IaaS) / Cloud-based deployments
- inputs / ANN in practice
- IntelliJ
- configuration, for working with Spark / Configuring IntelliJ to work with Spark and run Spark ML sample codes
- IntelliJ IDE
- Apache Spark 2.0, used for running first program / Running your first program using Apache Spark 2.0 with the IntelliJ IDE
- internal data sources
- used, for creating RDDs / Creating RDDs with Spark 2.0 using internal data sources
- Internet of Things (IoT) / Example - connection to a MQTT message broker
- Iris data
- downloading, for unsupervised classification / Downloading and understanding the famous Iris data for unsupervised classification
J
- Java / Scala
- Javascript Object Notation (JSON)
- working with, Datataset API and SQL used / Working with JSON using the Dataset API and SQL together
- about / How it works...
- Java Virtual Machine (JVM) / What's new in Apache Spark V2?, Hadoop Distributed File System
- JBOD (just a bunch of disks) approach / Cluster design
- JFreeChart
- reference / See also
- JFreeChart JAR files
- reference / How to do it...
- JMLR
- reference / There's more...
K
- K-Means
- working / K-Means in practice
- k-means streaming
- for real-time on-line classifier / Streaming KMeans for a real-time on-line classifier
- Kafka
- about / Kafka
- reference / Kafka
- using / Kafka
- Kaggle competition, winning with Apache SparkML
- about / Winning a Kaggle competition with Apache SparkML
- data preparation / Data preparation
- feature engineering / Feature engineering
- feature engineering pipeline, testing / Testing the feature engineering pipeline
- machine learning model, training / Training the machine learning model
- model evaluation / Model evaluation
- CrossValidation / CrossValidation and hyperparameter tuning
- hyperparameter tuning / CrossValidation and hyperparameter tuning
- evaluator, used for assessing quality of cross-validated model / Using the evaluator to assess the quality of the cross-validated and tuned model
- Kaggle competitions
- reference / There's more...
- KMeans
- bisecting / Bisecting KMeans, the new kid on the block in Spark 2.0
- bisecting, reference / There's more...
- streaming, for data classification / Streaming KMeans to classify data in near real-time
- streaming, reference / See also
- KMeans() object
- reference / See also
- KMeans classifying system
- building, in Spark 2.0 / Building a KMeans classifying system in Spark 2.0, How it works...
- KMeans (Lloyd Algorithm) / KMeans (Lloyd Algorithm)
- K-Means++ (Arthur's Algorithm) / KMeans++ (Arthur's algorithm)
- K-Means|| (pronounced as K-Means Parallel) / KMeans|| (pronounced as KMeans Parallel)
- KMeansModel() object
- reference / See also
- Kolmogorov-Smirnov (KS) / There's more...
L
- LabeledPoint data structure
- for Spark ML / LabeledPoint data structure for Spark ML
- reference / See also
- last-in first-out (LIFO) / Example - connection to a MQTT message broker
- late data / How Apache Spark improves windowing
- Latent Dirichlet Allocation (LDA)
- used, for classifying documents and text into topics / Latent Dirichlet Allocation (LDA) to classify documents and text into topics, See also
- about / Introduction
- latent factor models techniques
- Single Value Decomposition (SVD) / Latent factor models techniques
- Stochastic Gradient Decent (SGD) / Latent factor models techniques
- Alternating Least Square (ALS) / Latent factor models techniques
- latent factors / Building a scalable recommendation engine using collaborative filtering in Spark 2.0
- Latent Semantic Analysis (LSA)
- used, for text analytics with Spark 2.0 / Using Latent Semantic Analysis for text analytics with Spark 2.0
- LDAModel
- reference / See also
- libraries / Software versions and libraries used in this book
- linear regression
- streaming, for real-time regression / Streaming linear regression for a real-time regression
- Lustre / Hadoop Distributed File System
M
- machine learning
- about / Introduction, Machine learning
- data sources / See also
- machine learning library (MLlib) / Apache Spark
- MapR file system / Hadoop Distributed File System
- Maven-based build
- reference / Method 3: Making life easier with Spark testing base
- Maven repository
- Spark installation, reference / See also
- Mean Squared Error (MSE) / Regression model evaluation using Spark 2.0
- memory tuning
- about / Memory tuning
- memory usage / Memory usage and management
- memory management / Memory usage and management
- data structures, tuning / Tuning the data structures
- serialized RDD storage / Serialized RDD storage
- garbage collection tuning / Garbage collection tuning
- level of parallelism / Level of parallelism
- broadcasting / Broadcasting
- data locality / Data locality
- metrics
- reference / There's more...
- MinMaxScaler
- reference / See also
- ML pipelines
- creating, for real-life machine learning applications / ML pipelines for real-life machine learning applications
- model evaluation / Model evaluation
- model export facility
- exploring / New model export and PMML markup in Spark 2.0
- MovieLens dataset
- reference / How it works...
- MQTT / Spark Streaming
- MQTT (Message Queue Telemetry Transport) / Example - connection to a MQTT message broker
- MQTT message broker connection
- example / Example - connection to a MQTT message broker
- multiclass classification metrics
- reference / See also
- multiclass classification model
- evaluating, Spark 2.0 used / Multiclass classification model evaluation using Spark 2.0
- multilabel classification model
- evaluating, Spark 2.0 used / Multilabel classification model evaluation using Spark 2.0
- multilabel metrics
- reference / There's more...
- multivariate statistical summary
- reference / See also, See also
N
- Naive Bayes
- using / Theory on Classification
- working / Naive Bayes in practice
- netcat
- reference / TCP stream
- Net function / Artificial neural networks
- Neural Net (NN) / How it works...
- New GaussianMixture() parameter / New GaussianMixture()
- nodes / Spark graph processing
O
- OneHotEncoder / OneHotEncoder
- OOM (Out of Memory) messages
- avoiding / Memory
- optimization techniques
- about / Optimization techniques
- data serialization / Data serialization
- memory tuning / Memory tuning
- Out Of Memory (OOM) / Common mistakes in Spark app development
P
- paired key-value RDDs
- used, for join transformation / Join transformation with paired key-value RDDs
- used, for reducing transformation / Reduce and grouping transformation with paired key-value RDDs
- used, for grouping transformation / Reduce and grouping transformation with paired key-value RDDs
- parameters / The concept of pipelines
- partitions / RDDs - what started it all...
- pattern matching / There's more...
- performance / Performance
- performance-related problems, Spark
- reference / Cloud
- PIC (Power Iteration Clustering) / How it works...
- Pima Diabetes data
- downloading, for supervised classification / Downloading Pima Diabetes data for supervised classification
- pipelines / What does the new API look like?, The concept of pipelines, Pipelines
- Platform as a Service (PaaS) / Cloud-based deployments
- PMMLExportable API
- reference / See also
- PowerIterationClustering() constructor
- reference / See also
- Power Iteration Clustering (PIC)
- used, for classifying graph vertices / Classifying the vertices of a graph using Power Iteration Clustering (PIC) in Spark 2.0
- PowerIterationClusteringModel() constructor
- reference / See also
- practical machine learning, with Spark
- Scala, using / There's more...
- Predictive Model Markup Language (PMML)
- using / New model export and PMML markup in Spark 2.0
- Priority Queue / Example - connection to a MQTT message broker
- processing time
- versus event time / How Apache Spark improves windowing
Q
- Quantcast / Hadoop Distributed File System
- quasiquotes / There's more...
- queueStream
- used, for debugging / Streaming data and debugging with queueStream
- reference / How it works...
R
- R
- reference / How it works...
- RandomForestClassifier / RandomForestClassifier
- randomSplit()
- reference / See also
- RankingMetrics API
- documentation link / There's more...
- RDDs
- about / RDDs - what started it all...
- JdbcRDD / RDDs - what started it all...
- Vertex RDD / RDDs - what started it all...
- HadoopRDD / RDDs - what started it all...
- UnionRDD / RDDs - what started it all...
- RandomRDD / RDDs - what started it all...
- creating, internal data sources used / Creating RDDs with Spark 2.0 using internal data sources
- creating, external data sources used / Creating RDDs with Spark 2.0 using external data sources
- transforming, filter() API used / Transforming RDDs with Spark 2.0 using the filter() API
- transforming, flatMap() API / Transforming RDDs with the super useful flatMap() API
- transforming, with set operation APIs / Transforming RDDs with set operation APIs
- transforming, with zip() API / Transforming RDDs with the zip() API
- datasets, creating / Creating and using Datasets from RDDs and back again
- datasets, using / Creating and using Datasets from RDDs and back again
- versus Data Frame / Creating and using RDD versus DataFrame versus Dataset from a text file in Spark 2.0
- versus Dataset from text file / Creating and using RDD versus DataFrame versus Dataset from a text file in Spark 2.0
- documentation, reference / See also
- real-life machine learning applications
- ML pipelines, creating / ML pipelines for real-life machine learning applications
- real-life Spark ML project
- dump of Wikipedia, downloading / Downloading a complete dump of Wikipedia for a real-life Spark ML project
- real-time machine learning
- structured streaming / Structured streaming for near real-time machine learning
- DataFrames, streaming / Streaming DataFrames for real-time machine learning
- Datasets, streaming / Streaming Datasets for real-time machine learning
- real-time on-line classifier
- k-means streaming / Streaming KMeans for a real-time on-line classifier
- real-time regression
- linear regression, streaming / Streaming linear regression for a real-time regression
- recommendation engines / Introduction
- recommendation system
- about / Introduction
- movie data details, exploring / Exploring the movies data details for the recommendation system in Spark 2.0
- ratings data details, exploring / Exploring the ratings data details for the recommendation system in Spark 2.0, There's more...
- recovery / Errors and recovery
- reduceByKey() method
- used, for RDD transformation/aggregation / RDD transformation/aggregation with groupBy() and reduceByKey()
- RegressionMetrics facility / Regression model evaluation using Spark 2.0
- relationships / Spark graph processing
- Resilient Distributed Datasets (RDDs) / Machine learning
S
- sample ML code
- running, from Spark / Running a sample ML code from Spark
- sbt tool / The development environment
- Scala / The development environment, Introduction, Scala
- scalable recommendation engine
- required data, setting up / Setting up the required data for a scalable recommendation engine in Spark 2.0
- building, with collaborative filtering / Building a scalable recommendation engine using collaborative filtering in Spark 2.0, There's more...
- Scala Breeze library
- used, for creating graphics in Spark 2.0 / Using the Scala Breeze library to do graphics in Spark 2.0
- Scala data structures
- DataFrames, creating / Creating DataFrames from Scala data structures
- Scala pattern matching
- reference / There's more...
- Scala quasiquotes
- reference / There's more...
- Scala Sequence
- used, for working with Dataset API / Working with the Dataset API using a Scala Sequence
- ScalaTest's assertions
- reference / Testing Scala methods
- Scala test guideline
- reference / Testing Scala methods
- scikit-learn
- reference / There's more...
- session-based windows / How streaming engines use windowing
- set operation APIs
- used, for transforming RDDs / Transforming RDDs with set operation APIs
- Single Value Decomposition (SVD)
- reference / See also
- skip-gram model with negative sampling (SGNS) / There's more...
- sliding windows / How streaming engines use windowing
- Software as a Service (SaaS) / Cloud-based deployments
- software versions / Software versions and libraries used in this book
- Spark
- testing, in distributed environment / Testing in a distributed environment
- reference / Software versions and libraries used in this book
- sample ML code, running / Running a sample ML code from Spark
- download link / There's more...
- used, for normalizing data / Normalizing data with Spark
- tools / Introduction
- term frequency, doing / Doing term frequency with Spark - everything that counts
- used, for displaying similar words / Displaying similar words with Spark using Word2Vec
- Spark 1.6 streaming
- reference / There's more...
- Spark 2.0
- access to SarkContext vis-a-vis SparkSession object, obtaining / Getting access to SparkContext vis-a-vis SparkSession object in Spark 2.0
- regression model, evaluating / Regression model evaluation using Spark 2.0
- used, for multiclass classification model evaluation / Multiclass classification model evaluation using Spark 2.0
- used, for multilabel classification model evaluation / Multilabel classification model evaluation using Spark 2.0
- Scala Breeze library, used for creating graphics / Using the Scala Breeze library to do graphics in Spark 2.0
- KMeans classifying system, building / Building a KMeans classifying system in Spark 2.0
- KMeans, bisecting / Bisecting KMeans, the new kid on the block in Spark 2.0
- Latent Semantic Analysis, used for text analysis / Using Latent Semantic Analysis for text analytics with Spark 2.0
- topic modeling, with Latent Dirichlet allocation / Topic modeling with Latent Dirichlet allocation in Spark 2.0
- Spark 2.0 ML documentation
- reference / See also
- Spark 2.0 MLlib
- documentation link / See also
- Spark 2.0+
- Spark cluster, accessing / Getting access to Spark cluster in Spark 2.0
- Spark applications
- visualizing, web UI used / Visualizing Spark application using web UI
- running, observing / Observing the running and completed Spark jobs
- completed Spark jobs, observing / Observing the running and completed Spark jobs
- debugging, logs used / Debugging Spark applications using logs
- testing / Testing Spark applications, Testing Spark applications
- Scala methods, testing / Testing Scala methods
- unit testing / Unit testing
- testing, with Scala JUnit test / Method 1: Using Scala JUnit test
- Scala code, testing with FunSuite / Method 2: Testing Scala code using FunSuite
- Spark testing base / Method 3: Making life easier with Spark testing base
- debugging / Debugging Spark applications
- Spark cluster
- accessing, in Spark 2.0+ / Getting access to Spark cluster in Spark 2.0
- Spark cluster pre-Spark 2.0
- access, obtaining / Getting access to Spark cluster pre-Spark 2.0
- Spark configuration
- about / Spark configuration
- Spark properties / Spark properties
- environment variables / Environmental variables
- logging / Logging
- SparkContext
- documentation reference / See also
- reference / See also
- SparkContext vis-a-vis SparkSession object
- access, obtaining / Getting access to SparkContext vis-a-vis SparkSession object in Spark 2.0
- Spark graph processing / Spark graph processing
- Spark jobs, monitoring
- about / Monitoring Spark jobs
- Spark web interface / Spark web interface
- Spark machine learning / Spark machine learning
- SparkML / Spark machine learning
- Spark ML
- LabeledPoint data structure / LabeledPoint data structure for Spark ML
- SparkML API / What does the new API look like?
- Spark MLlib
- architecture / Architecture
- development environment / The development environment
- Spark ML sample codes
- running / Configuring IntelliJ to work with Spark and run Spark ML sample codes
- Spark program
- graphics, adding / How to add graphics to your Spark program
- SparkSession
- reference / See also
- SparkSQL
- DataFrames, using / How to do it...
- Spark SQL / Spark SQL
- Spark Stream Context (SSC) / Overview
- Spark streaming
- about / Introduction
- reference / There's more...
- Spark Streaming / Spark Streaming
- Spark testing base
- reference / Method 3: Making life easier with Spark testing base
- Spark web interface
- about / Spark web interface
- Jobs / Jobs
- Stages / Stages
- Storage / Storage
- Environment / Environment
- Executors / Executors
- SQL / SQL
- sparse matrix / An example - alternating least squares
- SparseVector API
- reference / See also
- sparse vector representations / Feature engineering
- specialized datasets
- reference / See also
- static rewrites / High-level operators are generated
- stemming
- reference / How it works...
- streaming engines
- windowing, using / How streaming engines use windowing
- streaming regression
- wine quality data, downloading / Downloading wine quality data for streaming regression
- Streaming sources
- about / Streaming sources
- TCP stream / TCP stream
- file streams / File streams
- Flume / Flume
- Kafka / Kafka
- stream life cycle management / More on stream life cycle management
- stream processing / Spark Streaming
- string indexer / String indexer
- structured streaming
- for near real-time machine learning / Structured streaming for near real-time machine learning
- reference / See also
- supervised classification
- Pima Diabetes data, downloading / Downloading Pima Diabetes data for supervised classification
- Support Vector Machine (SVM)
- about / Normalizing data with Spark
T
- Tachyon / Hadoop Distributed File System
- TCP stream / TCP stream
- TDD (test-driven development) / Testing Scala methods
- text analysis / Introduction
- time-based windows / How streaming engines use windowing
- TinkerPop
- reference / Overview
- topic modeling
- with Latent Dirichlet allocation / Topic modeling with Latent Dirichlet allocation in Spark 2.0
- transformers
- about / The concept of pipelines, Transformers
- string indexer / String indexer
- OneHotEncoder / OneHotEncoder
- VectorAssembler / VectorAssembler
- transparent fault tolerance
- achieving / How transparent fault tolerance and exactly-once delivery guarantee is achieved, Idempotent sinks prevent data duplication
- tumbling windows / How streaming engines use windowing
U
- UDF (user-defined function) / Memory usage and management
- unit testing
- Spark applications / Unit testing
- unit vectors
- reference / There's more...
- unsupervised classification
- Iris data, downloading / Downloading and understanding the famous Iris data for unsupervised classification
- unsupervised learning / Introduction
V
- VectorAssembler / VectorAssembler
- vertices, graph
- classification, with Power Iteration Clustering (PIC) / Classifying the vertices of a graph using Power Iteration Clustering (PIC) in Spark 2.0
- Virtual Machine (VM) / Challenges of software testing in a distributed environment
W
- wdivmm (weighted divide matrix multiplication) / High-level operators are generated
- weighted local neighborhood / Neighborhood method
- Wikipedia dump
- downloading, for real-life Spark ML project / Downloading a complete dump of Wikipedia for a real-life Spark ML project
- windowing / Windowing
- Windows
- Hadoop runtime, configuring / Configuring Hadoop runtime on Windows
- wine quality data
- downloading, for streaming / Downloading wine quality data for streaming regression
- Within Set Sum of Squared Errors (WSSSE) / K-Means in practice, How to do it...
- Word2Vec
- used, for displaying similar words with Spark / Displaying similar words with Spark using Word2Vec
- reference / There's more..., See also
- World Wide Web (WWW) / Testing in a distributed environment
- Write Ahead Log (WAL) / How transparent fault tolerance and exactly-once delivery guarantee is achieved
Z
- ZeroMQ / Spark Streaming
- zip() API
- used, for transforming RDDs / Transforming RDDs with the zip() API