Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Apache Spark 2: Data Processing and Real-Time Analytics

You're reading from   Apache Spark 2: Data Processing and Real-Time Analytics Master complex big data processing, stream analytics, and machine learning with Apache Spark

Arrow left icon
Product type Course
Published in Dec 2018
Publisher Packt
ISBN-13 9781789959208
Length 616 pages
Edition 1st Edition
Languages
Concepts
Arrow right icon
Authors (7):
Arrow left icon
Sridhar Alla Sridhar Alla
Author Profile Icon Sridhar Alla
Sridhar Alla
Romeo Kienzler Romeo Kienzler
Author Profile Icon Romeo Kienzler
Romeo Kienzler
Siamak Amirghodsi Siamak Amirghodsi
Author Profile Icon Siamak Amirghodsi
Siamak Amirghodsi
Broderick Hall Broderick Hall
Author Profile Icon Broderick Hall
Broderick Hall
Md. Rezaul Karim Md. Rezaul Karim
Author Profile Icon Md. Rezaul Karim
Md. Rezaul Karim
Meenakshi Rajendran Meenakshi Rajendran
Author Profile Icon Meenakshi Rajendran
Meenakshi Rajendran
Shuen Mei Shuen Mei
Author Profile Icon Shuen Mei
Shuen Mei
+3 more Show less
Arrow right icon
View More author details
Toc

Table of Contents (23) Chapters Close

Title Page
Copyright
About Packt
Contributors
Preface
1. A First Taste and What's New in Apache Spark V2 2. Apache Spark Streaming FREE CHAPTER 3. Structured Streaming 4. Apache Spark MLlib 5. Apache SparkML 6. Apache SystemML 7. Apache Spark GraphX 8. Spark Tuning 9. Testing and Debugging Spark 10. Practical Machine Learning with Spark Using Scala 11. Spark's Three Data Musketeers for Machine Learning - Perfect Together 12. Common Recipes for Implementing a Robust Machine Learning System 13. Recommendation Engine that Scales with Spark 14. Unsupervised Clustering with Apache Spark 2.0 15. Implementing Text Analytics with Spark 2.0 ML Library 16. Spark Streaming and Machine Learning Library 1. Other Books You May Enjoy Index

Index

A

  • abstract syntax tree (AST) / High-level operators are generated
  • ACM Digital Library
    • reference / There's more...
  • Alluxio / Hadoop Distributed File System
  • alternating least square (ALS) / Introduction
  • alternating least squares (ALS) / An example - alternating least squares
  • Apache Giraph
    • reference / Overview
  • Apache Mesos / Apache Mesos
  • Apache Spark
    • cluster design / Cluster design
    • windowing, improving / How Apache Spark improves windowing
    / Introduction, Apache Spark
  • Apache Spark 2.0
    • used, for running first program with IntelliJ IDE / Running your first program using Apache Spark 2.0 with the IntelliJ IDE
  • Apache Spark GraphX
    • overview / Overview
  • Apache Spark GraphX module / Spark graph processing
  • ApacheSparkML pipelines
    • components / The concept of pipelines
  • Apache Spark V2
    • changes / What's new in Apache Spark V2?
  • Apache Spark V2.2
    • unsupported operations / Increased performance with good old friends
  • Apache Streaming
    • data sources / Overview
  • Apache SystemML
    • about / Spark machine learning, Why do we need just another library?
    • history / The history of Apache SystemML
    • performance measurements / Performance measurements
    • working / Apache SystemML in action
  • ApacheSystemML architecture
    • about / ApacheSystemML architecture
    • language parsing / Language parsing
    • high-level operators, generating / High-level operators are generated
    • low-level operators, optimizing / How low-level operators are optimized on
  • Apache YARN / Apache YARN
  • artificial neural networks (ANNs)
    • about / Artificial neural networks
    • working / ANN in practice

B

  • basic statistical API, Spark
    • used, for building algorithms / Spark's basic statistical API to help you build your own algorithms
  • batch time / Overview
  • binary classification model
    • evaluating, Spark 2.0 used / Binary classification model evaluation using Spark 2.0
  • Breeze graphics
    • reference / There's more...

C

  • C++ / Scala
  • Cassandra / Hadoop Distributed File System
  • Catalyst / There's more...
  • Ceph / Hadoop Distributed File System
  • checkpointing / Checkpointing
  • classification
    • with Naive Bayes / Classification with Naive Bayes
  • classifiers / How it works...
  • Cloud / Cloud
  • cloud-based deployments / Cloud-based deployments
  • Cloud9 library
    • reference / How to do it...
  • clustering
    • with K-means / Clustering with K-Means
  • clustering systems / Introduction
  • cluster management / Cluster management
  • cluster manager options, Apache Spark
    • local / Local
    • standalone / Standalone
  • cluster structure / The cluster structure
  • coding / Coding
  • collaborative filtering
    • about / Collaborative filtering
    • used, for building scalable recommendation engine / Building a scalable recommendation engine using collaborative filtering in Spark 2.0, There's more...
  • common mistakes, Spark app development
    • about / Common mistakes in Spark app development
    • application failure / Application failure
    • slow jobs / Slow jobs or unresponsiveness
    • unresponsiveness / Slow jobs or unresponsiveness
  • components, ApacheSparkML pipelines
    • DataFrame / The concept of pipelines
    • transformer / The concept of pipelines
    • estimator / The concept of pipelines, Estimators
    • pipeline / The concept of pipelines, Transformers, Pipelines
    • parameter / The concept of pipelines
  • confusion matrix
    • reference / There's more...
  • content filtering / Content filtering
  • continuous applications
    • about / The concept of continuous applications
    • unification / True unification - same code, same engine
    • controlling / Controlling continuous applications
  • continuous bag of words (CBOW) / How it works...
  • cost-based optimizer, for machine learning algorithms
    • about / A cost-based optimizer for machine learning algorithms
    • alternating least squares, example / An example - alternating least squares
  • count-based windows / How streaming engines use windowing
  • CrossValidation / CrossValidation

D

  • data
    • normalizing, with Spark / Normalizing data with Spark
    • streaming / Streaming data and debugging with queueStream
  • Databricks
    • reference / Dataset - a high-level unifying Data API
  • data classification
    • Gaussian Mixture, using / Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data
    • expectation maximization (EM), using / Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data
  • DataFrame / The concept of pipelines
  • DataFrame-based machine learning API / Spark machine learning
  • DataFrameReader / There's more...
  • DataFrames
    • about / Introduction, DataFrame - a natural evolution to unite API and SQL via a high-level API
    • creating, from Scala data structures / Creating DataFrames from Scala data structures
    • reference / See also
    • programmatic operation, without SQL / Operating on DataFrames programmatically without SQL
    • loading, from external source / Loading DataFrames and setup from an external source
    • documentation reference / See also, See also
    • using, with SparkSQL / Using DataFrames with standard SQL language - SparkSQL
    • streaming, for real-time machine learning / Streaming DataFrames for real-time machine learning
  • DataFrameWriter
    • reference / There's more...
  • data locality / Data locality
  • Data Mining Group (DMG) / There's more...
  • data serialization
    • about / Data serialization
    • Java serialization / Data serialization
    • Kryo serialization / Data serialization
  • dataset
    • about / Dataset - a high-level unifying Data API
    • strong type safety / Dataset - a high-level unifying Data API
    • Tungsten Memory Management, enabling / Dataset - a high-level unifying Data API
    • encoders / Dataset - a high-level unifying Data API
    • catalyst optimizer friendly / Dataset - a high-level unifying Data API
    • creating, from RDDs / Creating and using Datasets from RDDs and back again
    • using, from RDDs / Creating and using Datasets from RDDs and back again
    • streaming, for real-time machine learning / Streaming Datasets for real-time machine learning
  • Dataset API
    • working with, Scala Sequence used / Working with the Dataset API using a Scala Sequence
    • used, for performing operations / Common operations with the new Dataset API
    • reference / See also
  • Dataset API and SQL
    • used, for working with JSON / Working with JSON using the Dataset API and SQL together
  • data sources, for practical machine learning
    • identifying / Identifying data sources for practical machine learning
  • data splitting
    • for training / Splitting data for training and testing
    • for testing / Splitting data for training and testing
  • DataStreamReader
    • reference / See also
  • DataStreamWriter
    • reference / See also
  • datatypes
    • documentation, reference / There's more...
    • reference / There's more...
  • debugging, Spark applications
    • Spark Standalone / Debugging Spark applications using logs
    • YARN / Debugging Spark applications using logs
    • log4j, logging with / Logging with log4j with Spark, Logging with log4j with Spark recap
    • about / Debugging Spark applications, Debugging the Spark application
    • on Eclipse, as Scala debug / Debugging Spark application on Eclipse as Scala debug
    • Spark jobs running as local and standalone mode, debugging / Debugging Spark jobs running as local and standalone mode
    • on YARN or Mesos cluster / Debugging Spark applications on YARN or Mesos cluster
    • SBT, using / Debugging Spark application using SBT
  • Deeplearning4j
    • about / Spark machine learning
  • DenseVector API
    • reference / See also
  • dimensionality reduction systems / Introduction
  • directed acyclic graph (DAG) / Why on Apache Spark?
  • Directed Acyclic Graph (DAG) / Jobs
  • Dirichlet
    • reference / Topic modeling with Latent Dirichlet allocation in Spark 2.0
  • Discretized Stream (DStream) / Introduction
  • distributed environment, Spark
    • testing in / Testing in a distributed environment
    • about / Distributed environment
    • issues / Issues in a distributed system
    • software testing challenges / Challenges of software testing in a distributed environment
  • domain objects
    • used, for functional programming with Dataset API / Functional programming with the Dataset API using domain objects
  • DSL (domain specific language) / An example - alternating least squares
  • dynamic rewrites / High-level operators are generated

E

  • edges / Spark graph processing
  • errors / Errors and recovery
  • estimators
    • about / The concept of pipelines
    • RandomForestClassifier / RandomForestClassifier
  • ETL (Extract Transform Load) / Naive Bayes in practice
  • event time
    • versus processing time / How Apache Spark improves windowing
  • exactly-once delivery guarantee
    • achieving / How transparent fault tolerance and exactly-once delivery guarantee is achieved, State versioning guarantees consistent results after reruns
  • expectation maximization (EM)
    • used, for data classification / Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data
  • extended ecosystem / Extended ecosystem
  • external data sources
    • used, for creating RDDs / Creating RDDs with Spark 2.0 using external data sources
  • Extract Transform Load (ETL) / Spark machine learning

F

  • Facebook / Overview
  • features / VectorAssembler
  • FIFO (first-in first-out) scheduling / Standalone
  • file streams / File streams
  • filter() API
    • used, for transforming RDDs / Transforming RDDs with Spark 2.0 using the filter() API
    • reference / See also
  • firing mechanism (F(Net) / Artificial neural networks
  • first-in first-out (FIFO) / Example - connection to a MQTT message broker
  • flatMap() API
    • used, for transforming RDDs / Transforming RDDs with the super useful flatMap() API
  • Flume
    • about / Flume
    • reference / Flume
    • working / Flume
  • functional programming, with Dataset API
    • domain objects, using / Functional programming with the Dataset API using domain objects

G

  • garbage collection (GC) / What's new in Apache Spark V2?
  • Gaussian Mixture
    • used, for data classification / Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data
    • reference / See also
  • GaussianMixtureModel
    • reference / See also
  • General Electric (GE) / How it works...
  • GlusterFS / Hadoop Distributed File System
  • GPFS (General Purpose File System) / Hadoop Distributed File System
  • graph / Spark graph processing, Overview
  • graph analytics/processing, with GraphX
    • about / Graph analytics/processing with GraphX
    • raw data / The raw data
    • graph, creating / Creating a graph
    • counting example / Example 1 – counting
    • filtering example / Example 2 – filtering
    • PageRank example / Example 3 – PageRank
    • triangle counting example / Example 4 – triangle counting
    • connected components example / Example 5 – connected components
  • graphics
    • adding, to Spark program / How to add graphics to your Spark program
  • graph processing / Spark graph processing
  • groupBy() method
    • used, for RDD transformation/aggregation / RDD transformation/aggregation with groupBy() and reduceByKey()
    • reference / There's more...
  • Guava
    • about / There's more...
    • reference / There's more...

H

  • Hadoop / The development environment, Apache Spark
  • Hadoop Distributed File System / Hadoop Distributed File System
  • Hadoop Distributed File System (HDFS)
    • reference / The development environment
  • Hadoop runtime
    • configuring, on Windows / Configuring Hadoop runtime on Windows
  • HeartbeatReceiver RPC endpoint / Executors
  • hierarchical clustering approaches
    • divisive / How it works...
    • agglomerative / How it works...
    • reference / See also
  • high-level operators (HOPs) / High-level operators are generated
  • hyperparameters / The concept of pipelines
  • hyperparameter tuning / Hyperparameter tuning

I

  • IDEA documentation
    • reference / Debugging Spark application using SBT
  • IEEE Digital Library
    • reference / There's more...
  • implicit input, for training
    • dealing with / Dealing with implicit input for training
  • Infrastructure as a Service (IaaS) / Cloud-based deployments
  • inputs / ANN in practice
  • IntelliJ
    • configuration, for working with Spark / Configuring IntelliJ to work with Spark and run Spark ML sample codes
  • IntelliJ IDE
    • Apache Spark 2.0, used for running first program / Running your first program using Apache Spark 2.0 with the IntelliJ IDE
  • internal data sources
    • used, for creating RDDs / Creating RDDs with Spark 2.0 using internal data sources
  • Internet of Things (IoT) / Example - connection to a MQTT message broker
  • Iris data
    • downloading, for unsupervised classification / Downloading and understanding the famous Iris data for unsupervised classification

J

  • Java / Scala
  • Javascript Object Notation (JSON)
    • working with, Datataset API and SQL used / Working with JSON using the Dataset API and SQL together
    • about / How it works...
  • Java Virtual Machine (JVM) / What's new in Apache Spark V2?, Hadoop Distributed File System
  • JBOD (just a bunch of disks) approach / Cluster design
  • JFreeChart
    • reference / See also
  • JFreeChart JAR files
    • reference / How to do it...
  • JMLR
    • reference / There's more...

K

  • K-Means
    • working / K-Means in practice
  • k-means streaming
    • for real-time on-line classifier / Streaming KMeans for a real-time on-line classifier
  • Kafka
    • about / Kafka
    • reference / Kafka
    • using / Kafka
  • Kaggle competition, winning with Apache SparkML
    • about / Winning a Kaggle competition with Apache SparkML
    • data preparation / Data preparation
    • feature engineering / Feature engineering
    • feature engineering pipeline, testing / Testing the feature engineering pipeline
    • machine learning model, training / Training the machine learning model
    • model evaluation / Model evaluation
    • CrossValidation / CrossValidation and hyperparameter tuning
    • hyperparameter tuning / CrossValidation and hyperparameter tuning
    • evaluator, used for assessing quality of cross-validated model / Using the evaluator to assess the quality of the cross-validated and tuned model
  • Kaggle competitions
    • reference / There's more...
  • KMeans
    • bisecting / Bisecting KMeans, the new kid on the block in Spark 2.0
    • bisecting, reference / There's more...
    • streaming, for data classification / Streaming KMeans to classify data in near real-time
    • streaming, reference / See also
  • KMeans() object
    • reference / See also
  • KMeans classifying system
    • building, in Spark 2.0 / Building a KMeans classifying system in Spark 2.0, How it works...
    • KMeans (Lloyd Algorithm) / KMeans (Lloyd Algorithm)
    • K-Means++ (Arthur's Algorithm) / KMeans++ (Arthur's algorithm)
    • K-Means|| (pronounced as K-Means Parallel) / KMeans|| (pronounced as KMeans Parallel)
  • KMeansModel() object
    • reference / See also
  • Kolmogorov-Smirnov (KS) / There's more...

L

  • LabeledPoint data structure
    • for Spark ML / LabeledPoint data structure for Spark ML
    • reference / See also
  • last-in first-out (LIFO) / Example - connection to a MQTT message broker
  • late data / How Apache Spark improves windowing
  • Latent Dirichlet Allocation (LDA)
    • used, for classifying documents and text into topics / Latent Dirichlet Allocation (LDA) to classify documents and text into topics, See also
    • about / Introduction
  • latent factor models techniques
    • Single Value Decomposition (SVD) / Latent factor models techniques
    • Stochastic Gradient Decent (SGD) / Latent factor models techniques
    • Alternating Least Square (ALS) / Latent factor models techniques
  • latent factors / Building a scalable recommendation engine using collaborative filtering in Spark 2.0
  • Latent Semantic Analysis (LSA)
    • used, for text analytics with Spark 2.0 / Using Latent Semantic Analysis for text analytics with Spark 2.0
  • LDAModel
    • reference / See also
  • libraries / Software versions and libraries used in this book
  • linear regression
    • streaming, for real-time regression / Streaming linear regression for a real-time regression
  • Lustre / Hadoop Distributed File System

M

  • machine learning
    • about / Introduction, Machine learning
    • data sources / See also
  • machine learning library (MLlib) / Apache Spark
  • MapR file system / Hadoop Distributed File System
  • Maven-based build
    • reference / Method 3: Making life easier with Spark testing base
  • Maven repository
    • Spark installation, reference / See also
  • Mean Squared Error (MSE) / Regression model evaluation using Spark 2.0
  • memory tuning
    • about / Memory tuning
    • memory usage / Memory usage and management
    • memory management / Memory usage and management
    • data structures, tuning / Tuning the data structures
    • serialized RDD storage / Serialized RDD storage
    • garbage collection tuning / Garbage collection tuning
    • level of parallelism / Level of parallelism
    • broadcasting / Broadcasting
    • data locality / Data locality
  • metrics
    • reference / There's more...
  • MinMaxScaler
    • reference / See also
  • ML pipelines
    • creating, for real-life machine learning applications / ML pipelines for real-life machine learning applications
  • model evaluation / Model evaluation
  • model export facility
    • exploring / New model export and PMML markup in Spark 2.0
  • MovieLens dataset
    • reference / How it works...
  • MQTT / Spark Streaming
  • MQTT (Message Queue Telemetry Transport) / Example - connection to a MQTT message broker
  • MQTT message broker connection
    • example / Example - connection to a MQTT message broker
  • multiclass classification metrics
    • reference / See also
  • multiclass classification model
    • evaluating, Spark 2.0 used / Multiclass classification model evaluation using Spark 2.0
  • multilabel classification model
    • evaluating, Spark 2.0 used / Multilabel classification model evaluation using Spark 2.0
  • multilabel metrics
    • reference / There's more...
  • multivariate statistical summary
    • reference / See also, See also

N

  • Naive Bayes
    • using / Theory on Classification
    • working / Naive Bayes in practice
  • netcat
    • reference / TCP stream
  • Net function / Artificial neural networks
  • Neural Net (NN) / How it works...
  • New GaussianMixture() parameter / New GaussianMixture()
  • nodes / Spark graph processing

O

  • OneHotEncoder / OneHotEncoder
  • OOM (Out of Memory) messages
    • avoiding / Memory
  • optimization techniques
    • about / Optimization techniques
    • data serialization / Data serialization
    • memory tuning / Memory tuning
  • Out Of Memory (OOM) / Common mistakes in Spark app development

P

  • paired key-value RDDs
    • used, for join transformation / Join transformation with paired key-value RDDs
    • used, for reducing transformation / Reduce and grouping transformation with paired key-value RDDs
    • used, for grouping transformation / Reduce and grouping transformation with paired key-value RDDs
  • parameters / The concept of pipelines
  • partitions / RDDs - what started it all...
  • pattern matching / There's more...
  • performance / Performance
  • performance-related problems, Spark
    • reference / Cloud
  • PIC (Power Iteration Clustering) / How it works...
  • Pima Diabetes data
    • downloading, for supervised classification / Downloading Pima Diabetes data for supervised classification
  • pipelines / What does the new API look like?, The concept of pipelines, Pipelines
  • Platform as a Service (PaaS) / Cloud-based deployments
  • PMMLExportable API
    • reference / See also
  • PowerIterationClustering() constructor
    • reference / See also
  • Power Iteration Clustering (PIC)
    • used, for classifying graph vertices / Classifying the vertices of a graph using Power Iteration Clustering (PIC) in Spark 2.0
  • PowerIterationClusteringModel() constructor
    • reference / See also
  • practical machine learning, with Spark
    • Scala, using / There's more...
  • Predictive Model Markup Language (PMML)
    • using / New model export and PMML markup in Spark 2.0
  • Priority Queue / Example - connection to a MQTT message broker
  • processing time
    • versus event time / How Apache Spark improves windowing

Q

  • Quantcast / Hadoop Distributed File System
  • quasiquotes / There's more...
  • queueStream
    • used, for debugging / Streaming data and debugging with queueStream
    • reference / How it works...

R

  • R
    • reference / How it works...
  • RandomForestClassifier / RandomForestClassifier
  • randomSplit()
    • reference / See also
  • RankingMetrics API
    • documentation link / There's more...
  • RDDs
    • about / RDDs - what started it all...
    • JdbcRDD / RDDs - what started it all...
    • Vertex RDD / RDDs - what started it all...
    • HadoopRDD / RDDs - what started it all...
    • UnionRDD / RDDs - what started it all...
    • RandomRDD / RDDs - what started it all...
    • creating, internal data sources used / Creating RDDs with Spark 2.0 using internal data sources
    • creating, external data sources used / Creating RDDs with Spark 2.0 using external data sources
    • transforming, filter() API used / Transforming RDDs with Spark 2.0 using the filter() API
    • transforming, flatMap() API / Transforming RDDs with the super useful flatMap() API
    • transforming, with set operation APIs / Transforming RDDs with set operation APIs
    • transforming, with zip() API / Transforming RDDs with the zip() API
    • datasets, creating / Creating and using Datasets from RDDs and back again
    • datasets, using / Creating and using Datasets from RDDs and back again
    • versus Data Frame / Creating and using RDD versus DataFrame versus Dataset from a text file in Spark 2.0
    • versus Dataset from text file / Creating and using RDD versus DataFrame versus Dataset from a text file in Spark 2.0
    • documentation, reference / See also
  • real-life machine learning applications
    • ML pipelines, creating / ML pipelines for real-life machine learning applications
  • real-life Spark ML project
    • dump of Wikipedia, downloading / Downloading a complete dump of Wikipedia for a real-life Spark ML project
  • real-time machine learning
    • structured streaming / Structured streaming for near real-time machine learning
    • DataFrames, streaming / Streaming DataFrames for real-time machine learning
    • Datasets, streaming / Streaming Datasets for real-time machine learning
  • real-time on-line classifier
    • k-means streaming / Streaming KMeans for a real-time on-line classifier
  • real-time regression
    • linear regression, streaming / Streaming linear regression for a real-time regression
  • recommendation engines / Introduction
  • recommendation system
    • about / Introduction
    • movie data details, exploring / Exploring the movies data details for the recommendation system in Spark 2.0
    • ratings data details, exploring / Exploring the ratings data details for the recommendation system in Spark 2.0, There's more...
  • recovery / Errors and recovery
  • reduceByKey() method
    • used, for RDD transformation/aggregation / RDD transformation/aggregation with groupBy() and reduceByKey()
  • RegressionMetrics facility / Regression model evaluation using Spark 2.0
  • relationships / Spark graph processing
  • Resilient Distributed Datasets (RDDs) / Machine learning

S

  • sample ML code
    • running, from Spark / Running a sample ML code from Spark
  • sbt tool / The development environment
  • Scala / The development environment, Introduction, Scala
  • scalable recommendation engine
    • required data, setting up / Setting up the required data for a scalable recommendation engine in Spark 2.0
    • building, with collaborative filtering / Building a scalable recommendation engine using collaborative filtering in Spark 2.0, There's more...
  • Scala Breeze library
    • used, for creating graphics in Spark 2.0 / Using the Scala Breeze library to do graphics in Spark 2.0
  • Scala data structures
    • DataFrames, creating / Creating DataFrames from Scala data structures
  • Scala pattern matching
    • reference / There's more...
  • Scala quasiquotes
    • reference / There's more...
  • Scala Sequence
    • used, for working with Dataset API / Working with the Dataset API using a Scala Sequence
  • ScalaTest's assertions
    • reference / Testing Scala methods
  • Scala test guideline
    • reference / Testing Scala methods
  • scikit-learn
    • reference / There's more...
  • session-based windows / How streaming engines use windowing
  • set operation APIs
    • used, for transforming RDDs / Transforming RDDs with set operation APIs
  • Single Value Decomposition (SVD)
    • reference / See also
  • skip-gram model with negative sampling (SGNS) / There's more...
  • sliding windows / How streaming engines use windowing
  • Software as a Service (SaaS) / Cloud-based deployments
  • software versions / Software versions and libraries used in this book
  • Spark
    • testing, in distributed environment / Testing in a distributed environment
    • reference / Software versions and libraries used in this book
    • sample ML code, running / Running a sample ML code from Spark
    • download link / There's more...
    • used, for normalizing data / Normalizing data with Spark
    • tools / Introduction
    • term frequency, doing / Doing term frequency with Spark - everything that counts
    • used, for displaying similar words / Displaying similar words with Spark using Word2Vec
  • Spark 1.6 streaming
    • reference / There's more...
  • Spark 2.0
    • access to SarkContext vis-a-vis SparkSession object, obtaining / Getting access to SparkContext vis-a-vis SparkSession object in Spark 2.0
    • regression model, evaluating / Regression model evaluation using Spark 2.0
    • used, for multiclass classification model evaluation / Multiclass classification model evaluation using Spark 2.0
    • used, for multilabel classification model evaluation / Multilabel classification model evaluation using Spark 2.0
    • Scala Breeze library, used for creating graphics / Using the Scala Breeze library to do graphics in Spark 2.0
    • KMeans classifying system, building / Building a KMeans classifying system in Spark 2.0
    • KMeans, bisecting / Bisecting KMeans, the new kid on the block in Spark 2.0
    • Latent Semantic Analysis, used for text analysis / Using Latent Semantic Analysis for text analytics with Spark 2.0
    • topic modeling, with Latent Dirichlet allocation / Topic modeling with Latent Dirichlet allocation in Spark 2.0
  • Spark 2.0 ML documentation
    • reference / See also
  • Spark 2.0 MLlib
    • documentation link / See also
  • Spark 2.0+
    • Spark cluster, accessing / Getting access to Spark cluster in Spark 2.0
  • Spark applications
    • visualizing, web UI used / Visualizing Spark application using web UI
    • running, observing / Observing the running and completed Spark jobs
    • completed Spark jobs, observing / Observing the running and completed Spark jobs
    • debugging, logs used / Debugging Spark applications using logs
    • testing / Testing Spark applications, Testing Spark applications
    • Scala methods, testing / Testing Scala methods
    • unit testing / Unit testing
    • testing, with Scala JUnit test / Method 1: Using Scala JUnit test
    • Scala code, testing with FunSuite / Method 2: Testing Scala code using FunSuite
    • Spark testing base / Method 3: Making life easier with Spark testing base
    • debugging / Debugging Spark applications
  • Spark cluster
    • accessing, in Spark 2.0+ / Getting access to Spark cluster in Spark 2.0
  • Spark cluster pre-Spark 2.0
    • access, obtaining / Getting access to Spark cluster pre-Spark 2.0
  • Spark configuration
    • about / Spark configuration
    • Spark properties / Spark properties
    • environment variables / Environmental variables
    • logging / Logging
  • SparkContext
    • documentation reference / See also
    • reference / See also
  • SparkContext vis-a-vis SparkSession object
    • access, obtaining / Getting access to SparkContext vis-a-vis SparkSession object in Spark 2.0
  • Spark graph processing / Spark graph processing
  • Spark jobs, monitoring
    • about / Monitoring Spark jobs
    • Spark web interface / Spark web interface
  • Spark machine learning / Spark machine learning
  • SparkML / Spark machine learning
  • Spark ML
    • LabeledPoint data structure / LabeledPoint data structure for Spark ML
  • SparkML API / What does the new API look like?
  • Spark MLlib
    • architecture / Architecture
    • development environment / The development environment
  • Spark ML sample codes
    • running / Configuring IntelliJ to work with Spark and run Spark ML sample codes
  • Spark program
    • graphics, adding / How to add graphics to your Spark program
  • SparkSession
    • reference / See also
  • SparkSQL
    • DataFrames, using / How to do it...
  • Spark SQL / Spark SQL
  • Spark Stream Context (SSC) / Overview
  • Spark streaming
    • about / Introduction
    • reference / There's more...
  • Spark Streaming / Spark Streaming
  • Spark testing base
    • reference / Method 3: Making life easier with Spark testing base
  • Spark web interface
    • about / Spark web interface
    • Jobs / Jobs
    • Stages / Stages
    • Storage / Storage
    • Environment / Environment
    • Executors / Executors
    • SQL / SQL
  • sparse matrix / An example - alternating least squares
  • SparseVector API
    • reference / See also
  • sparse vector representations / Feature engineering
  • specialized datasets
    • reference / See also
  • static rewrites / High-level operators are generated
  • stemming
    • reference / How it works...
  • streaming engines
    • windowing, using / How streaming engines use windowing
  • streaming regression
    • wine quality data, downloading / Downloading wine quality data for streaming regression
  • Streaming sources
    • about / Streaming sources
    • TCP stream / TCP stream
    • file streams / File streams
    • Flume / Flume
    • Kafka / Kafka
  • stream life cycle management / More on stream life cycle management
  • stream processing / Spark Streaming
  • string indexer / String indexer
  • structured streaming
    • for near real-time machine learning / Structured streaming for near real-time machine learning
    • reference / See also
  • supervised classification
    • Pima Diabetes data, downloading / Downloading Pima Diabetes data for supervised classification
  • Support Vector Machine (SVM)
    • about / Normalizing data with Spark

T

  • Tachyon / Hadoop Distributed File System
  • TCP stream / TCP stream
  • TDD (test-driven development) / Testing Scala methods
  • text analysis / Introduction
  • time-based windows / How streaming engines use windowing
  • TinkerPop
    • reference / Overview
  • topic modeling
    • with Latent Dirichlet allocation / Topic modeling with Latent Dirichlet allocation in Spark 2.0
  • transformers
    • about / The concept of pipelines, Transformers
    • string indexer / String indexer
    • OneHotEncoder / OneHotEncoder
    • VectorAssembler / VectorAssembler
  • transparent fault tolerance
    • achieving / How transparent fault tolerance and exactly-once delivery guarantee is achieved, Idempotent sinks prevent data duplication
  • tumbling windows / How streaming engines use windowing

U

  • UDF (user-defined function) / Memory usage and management
  • unit testing
    • Spark applications / Unit testing
  • unit vectors
    • reference / There's more...
  • unsupervised classification
    • Iris data, downloading / Downloading and understanding the famous Iris data for unsupervised classification
  • unsupervised learning / Introduction

V

  • VectorAssembler / VectorAssembler
  • vertices, graph
    • classification, with Power Iteration Clustering (PIC) / Classifying the vertices of a graph using Power Iteration Clustering (PIC) in Spark 2.0
  • Virtual Machine (VM) / Challenges of software testing in a distributed environment

W

  • wdivmm (weighted divide matrix multiplication) / High-level operators are generated
  • weighted local neighborhood / Neighborhood method
  • Wikipedia dump
    • downloading, for real-life Spark ML project / Downloading a complete dump of Wikipedia for a real-life Spark ML project
  • windowing / Windowing
  • Windows
    • Hadoop runtime, configuring / Configuring Hadoop runtime on Windows
  • wine quality data
    • downloading, for streaming / Downloading wine quality data for streaming regression
  • Within Set Sum of Squared Errors (WSSSE) / K-Means in practice, How to do it...
  • Word2Vec
    • used, for displaying similar words with Spark / Displaying similar words with Spark using Word2Vec
    • reference / There's more..., See also
  • World Wide Web (WWW) / Testing in a distributed environment
  • Write Ahead Log (WAL) / How transparent fault tolerance and exactly-once delivery guarantee is achieved

Z

  • ZeroMQ / Spark Streaming
  • zip() API
    • used, for transforming RDDs / Transforming RDDs with the zip() API
lock icon The rest of the chapter is locked
arrow left Previous Section
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image