Index
A
- Apache Spark
- about / Introduction
- tools / Introduction
- URL / Getting Apache Spark
- obtaining / How to do it...
- apply method / Constructing a vector from values
- arbitrary transformations
- URL / How to do it...
- ATLAS
- URL / The org.scalanlp.breeze-natives package
- Avro data model
- using, in Parquet / Using the Avro data model in Parquet, How to do it…
- URL / Using the Avro data model in Parquet
- creating / Creation of the Avro model
- schema_complex, URL / Creation of the Avro model
- schema_primitive, URL / Creation of the Avro model
- Avro objects, generating with sbt-avro plugin / Generation of Avro objects using the sbt-avro plugin
- RDD of generated object, constructing from Students.csv / Constructing an RDD of our generated object from Students.csv
B
- binary classification
- LogisticRegression, using with Pipeline API / Binary classification using LogisticRegression with Pipeline API
- binary classification, with LogisticRegression and SVM
- about / Binary classification using LogisticRegression and SVM
- Bokeh-Scala
- URL / Introduction
- used, for creating scatter plots / Creating scatter plots with Bokeh-Scala, How to do it...
- glyph / How to do it...
- plot / How to do it...
- document / How to do it...
- used, for creating time series MultiPlot / Creating a time series MultiPlot with Bokeh-Scala, How to do it...
- Breeze
- URL / Getting Breeze – the linear algebra library
- about / Getting Breeze – the linear algebra library
- breeze dependencies / How to do it...
- breeze-native dependencies / How to do it...
- obtaining / How to do it...
- org.scalanlp.breeze dependency / The org.scalanlp.breeze dependency
- org.scalanlp.breeze-natives package / The org.scalanlp.breeze-natives package
- org.scalanlp.breeze-natives package, URL / The org.scalanlp.breeze-natives package
- breeze-viz / Creating scatter plots with Bokeh-Scala
C
- chill library
- reference link / How to do it...
- classes
- more than 22 features, loading / Loading more than 22 features into classes, How to do it..., How it works...
- clustering
- about / Clustering using K-means
- K-means, using / Clustering using K-means, How to do it...
- continuous values
- predicting, with linear regression / Predicting continuous values using linear regression, How to do it...
- CSV
- DataFrame, creating from / Creating a DataFrame from CSV, How to do it..., There's more…
- CSV files
- reading / Reading and writing CSV files, How it works...
- writing / Reading and writing CSV files, How it works...
- csvread function / How it works...
D
- data
- preparing, in Dataframes / Preparing data in Dataframes, How to do it...
- pulling, from ElasticSearch / How to do it...
- DataFrame
- creating, from CSV / Creating a DataFrame from CSV, How to do it..., There's more…
- URL / Creating a DataFrame from CSV, Manipulating DataFrames, Creating a DataFrame from Scala case classes
- manipulating / Manipulating DataFrames, How to do it...
- schema, printing / Printing the schema of the DataFrame
- data, sampling / Sampling the data in the DataFrame
- columns, selecting / Selecting DataFrame columns
- data by condition, filtering / Filtering data by condition
- data, sorting in frame / Sorting data in the frame
- columns, renaming / Renaming columns
- treating, as relational table / Treating the DataFrame as a relational table
- two DataFrame, joining / Joining two DataFrames
- inner join / Inner join
- right outer join / Right outer join
- left outer join / Left outer join
- saving, as file / Saving the DataFrame as a file
- creating, from Scala case classes / Creating a DataFrame from Scala case classes, How to do it..., How it works...
- JSON, loading / Loading JSON into DataFrames, How to do it…
- JSON file, reading with SQLContext.jsonFile / Reading a JSON file using SQLContext.jsonFile
- text file, converting to JSON RDD / Reading a text file and converting it to JSON RDD
- text file, reading / Reading a text file and converting it to JSON RDD
- schema, explicitly specifying / Explicitly specifying your schema, There's more…
- data, preparing / Preparing data in Dataframes, How to do it...
- Directed Acyclic Graph (DAG) / Submitting jobs to the Spark cluster (local)
- Dow Jones Index Data Set
- URL / Creating a time series MultiPlot with Bokeh-Scala
- Driver program / There's more…
- DStreams
- about / Using Spark Streaming to subscribe to a Twitter stream
E
- EC2
- Spark Standalone cluster, running / Running the Spark Standalone cluster on EC2
- Elasticsearch
- URL, for downloading installable / How to do it...
- ElasticSearch
- URL / Using Spark Streaming to subscribe to a Twitter stream
- data, pulling from / How to do it...
- ETL tool
- Spark, using as / Using Spark as an ETL tool, How to do it...
F
- feature reduction
- PCA, using / Feature reduction using principal component analysis
G
- gradient descent
- about / Gradient descent
- Graphviz
- URL / Transitive dependency stated explicitly in the SBT dependency
- GraphX
- about / Using GraphX to analyze Twitter data
- used, for analyzing Twitter data / How to do it...
H
- Hadoop cluster
- URL / Installing the Hadoop cluster
- Hadoop Distributed File System (HDFS)
- about / Loading JSON into DataFrames
- URL / Loading JSON into DataFrames
- HDFS
- data, pushing / Pushing data into HDFS
- head function / Using the tools to inspect the Parquet file
- Hive table
- URL / Save it as a Parquet file
I
- instance-type
- URL / Running the launch script
- iris data
- URL / How to do it...
J
- Joda-Time API / Preparing our data
K
- K-means
- used, for clustering / Clustering using K-means
- about / Clustering using K-means, How to do it...
- KMeans.RANDOM / KMeans.RANDOM
- KMeans.PARALLEL / KMeans.PARALLEL
- max iterations / Max iterations
- epsilon / Epsilon
- data, importing / Importing the data and converting it into a vector
- data, converting into vector / Importing the data and converting it into a vector
- data, feature scaling / Feature scaling the data
- number of clusters, deriving / Deriving the number of clusters
- model, constructing / Constructing the model
- model, evaluating / Evaluating the model
- Kafka
- setting up / How to do it...
- Kafka server
- starting / How to do it...
- Kafka topic
- creating / How to do it...
- Kafka version 0.8.2.1, for Spark 2.10
- URL / How to do it...
- KMeans.PARALLEL
- about / KMeans.PARALLEL
- K-means++ / K-means++
- K-means|| / K-means||
- Kryo / Saving RDD[StudentAvro] in a Parquet file
- KryoSerializer
- about / Using Spark as an ETL tool
- used, for publishing data to Kafka / How to do it...
L
- legends property / Adding a legend to the plot
- Lempel-Ziv-Oberhumer (LZO) / Enable compression for the Parquet file
- linear regression
- used, for predicting continuous values / Predicting continuous values using linear regression, How to do it...
- data. importing / Importing the data
- each instance, converting into LabeledPoint / Converting each instance into a LabeledPoint
- training, preparing / Preparing the training and test data
- test data, preparing / Preparing the training and test data
- features, scaling / Scaling the features
- model, training / Training the model
- test data, predicting against / Predicting against test data
- model, evaluating / Evaluating the model
- parameters, regularizing / Regularizing the parameters
- mini batching / Mini batching
- LogisticRegression
- used, for binary classification with Pipeline API / Binary classification using LogisticRegression with Pipeline API
M
- matrices
- working with / Working with matrices, How to do it...
- creating / Creating matrices
- creating, from values / Creating a matrix from values
- zero matrix, creating / Creating a zero matrix
- creating, out of function / Creating a matrix out of a function
- identity matrix, creating / Creating an identity matrix
- creating, from random numbers / Creating a matrix from random numbers
- Scala collection, creating / Creating from a Scala collection
- appending / Appending and conversion
- concatenating / Concatenating matrices – vertically
- concatenating, hvertcat function / Concatenating matrices – vertically
- concatenating, horzcat function / Concatenating matrices – horizontally
- data manipulation operations / Data manipulation operations
- basic statistics, computing / Computing basic statistics
- mean and variance / Mean and variance
- standard deviation / Standard deviation
- working / How it works...
- with randomly distributed values / Vectors and matrices with randomly distributed values, How it works...
- matrix
- column vectors, obtaining / Getting column vectors out of the matrix
- row vectors, obtaining / Getting row vectors out of the matrix
- inside values, obtaining / Getting values inside the matrix
- inverse, obtaining / Getting the inverse and transpose of a matrix
- transpose, obtaining / Getting the inverse and transpose of a matrix
- largest value, finding / Finding the largest value in a matrix
- sum, finding / Finding the sum, square root and log of all the values in the matrix
- square root, finding / Finding the sum, square root and log of all the values in the matrix
- log of all values, finding / Finding the sum, square root and log of all the values in the matrix
- sqrt function / Finding the sum, square root and log of all the values in the matrix
- log function / Calculating the eigenvectors and eigenvalues of a matrix
- eigenvectors, calculating / Calculating the eigenvectors and eigenvalues of a matrix
- eigenvalues, calculating / Calculating the eigenvectors and eigenvalues of a matrix
- with uniformly random values, creating / Creating a matrix with uniformly random values
- with normally distributed random values, creating / Creating a matrix with normally distributed random values
- with random values with Poisson distribution, creating / Creating a matrix with random values that has a Poisson distribution
- matrix arithmetic
- about / Matrix arithmetic
- addition / Addition
- multiplication / Multiplication
- matrix of Int
- converting, into matrix of Double / Converting a matrix of Int to a matrix of Double
- Mesos
- Spark job, running / How to do it...
- installing / Installing Mesos
- URL / Installing Mesos
- master and slave, starting / Starting the Mesos master and slave
- Spark binary package, uploading to HDFS / Uploading the Spark binary package and the dataset to HDFS
- dataset, uploading to HDFS / Uploading the Spark binary package and the dataset to HDFS
- micro-batching
- about / Using Spark Streaming to subscribe to a Twitter stream
N
- NumPy
- URL / Getting Breeze – the linear algebra library
O
- OpenBLAS
- URL / The org.scalanlp.breeze-natives package
P
- PairRDD
- URL / Saving RDD[StudentAvro] in a Parquet file
- Parquet
- URL / Storing data as Parquet files
- parquet-tools, URL / Install Parquet tools
- Avro data model, using / Using the Avro data model in Parquet, How to do it…
- Parquet-MR project
- URL / Storing data as Parquet files
- Parquet files
- data, storing as / Storing data as Parquet files, How to do it…, Load a simple CSV file, convert it to case classes, and create a DataFrame from it, Save it as a Parquet file
- inspecting. with tools / Using the tools to inspect the Parquet file
- Snappy compression of data, enabling / Enable compression for the Parquet file
- RDD[StudentAvro], saving / Saving RDD[StudentAvro] in a Parquet file
- file back, reading for verification / Reading the file back for verification
- Parquet tools
- installing / Install Parquet tools
- using, for verification / Using Parquet tools for verification
- PCA
- used, for feature reduction / Feature reduction using principal component analysis, How to do it...
- about / Feature reduction using principal component analysis
- dimensionality reduction, of data for supervised learning / Dimensionality reduction of data for supervised learning
- training data, mean-normalizing / Mean-normalizing the training data, Mean-normalizing the training data
- principal components, extracting / Extracting the principal components, Extracting the principal components
- labeled data, preparing / Preparing the labeled data
- test data, preparing / Preparing the test data
- metrics, classifying / Classify and evaluate the metrics
- metrics, evaluating / Classify and evaluate the metrics, Evaluating the metrics
- data, dimensionality reduction / Dimensionality reduction of data for unsupervised learning
- number of components / Arriving at the number of components
- pem key
- URL / Creating the AccessKey and pem file
- Pipeline API, used for solving binary classification
- data, importing as test / Importing and splitting data as test and training sets
- data, importing as training sets / Importing and splitting data as test and training sets
- data, splitting as training sets / Importing and splitting data as test and training sets
- data, splitting as test / Importing and splitting data as test and training sets
- participants, constructing / Construct the participants of the Pipeline
- pipeline, preparing / Preparing a pipeline and training a model
- model, training / Preparing a pipeline and training a model
- test data, predicting against / Predicting against test data
- mode, evaluating without cross-validation / Evaluating a model without cross-validation
- parameters for cross-validation, constructing / Constructing parameters for cross-validation
- cross-validator, constructing / Constructing cross-validator and fit the best model
- model, evaluating with cross-validation / Evaluating the model with cross-validation
- Pipeline API, used for solving binary classification problem
- about / Binary classification using LogisticRegression with Pipeline API
- prerequisite, for running ElasticSearch instance on machine
- Elasticsearch, running / How to do it...
- Twitter app, creating / How to do it...
- Spark Streaming, adding / How to do it...
- Twitter dependency, adding / How to do it...
- Twitter stream, creating / How to do it...
- stream, saving to ElasticSearch / How to do it...
- Principal Component Analysis (PCA) / Gradient descent
- Privacy Enhanced Mail (PEM) / How to do it...
- Product
- API docs, URL / How to do it...
- pseudo-clustered mode
- HDFS, running / Running HDFS on Pseudo-clustered mode
- URL / Running HDFS on Pseudo-clustered mode
R
- RDBMS
- loading / Loading from RDBMS, How to do it…
- reduceByKey function / How to do it...
- Resilient Distributed Dataset (RDD) / How it works...
- RowGroups / Storing data as Parquet files
S
- save method / Save it as a Parquet file
- sbt-avro plugin
- URL / Generation of Avro objects using the sbt-avro plugin
- sbt-dependency-graph plugin
- URL / How to do it...
- SBT assembly plugin
- URL / How to do it...
- sbteclipse plugin
- URL / How to do it...
- Scala bindings
- URL / Introduction
- Scala Build Tool (SBT) / Getting Breeze – the linear algebra library
- Scala case classes
- DataFrame, creating from / Creating a DataFrame from Scala case classes, How to do it..., How it works...
- scatter plots, creating with Bokeh-Scala
- about / Creating scatter plots with Bokeh-Scala, How to do it...
- data, preparing / Preparing our data
- Plot and Document objects, creating / Creating Plot and Document objects
- marker object, creating / Creating a marker object
- x and y axes data ranges, setting for plot / Setting the X and Y axes' data range for the plot
- x and y axes, drawing / Drawing the x and the y axes
- flower species with varying colors, viewing / Viewing flower species with varying colors
- grid lines, adding / Adding grid lines
- legend, adding to plot / Adding a legend to the plot
- URL / Adding a legend to the plot
- Sense plugin
- URL / How to do it...
- Snappy
- URL / Enable compression for the Parquet file
- Snappy compression / Enable compression for the Parquet file
- source build tool (SBT) / Getting Apache Spark
- Spark
- downloading / Downloading Spark
- URL, for download / Downloading Spark
- using, as ETL tool / Using Spark as an ETL tool, How to do it...
- spark.driver.extraClassPath property
- URL / Building the Uber JAR
- Spark 14
- URL / Submitting jobs to the Spark cluster (local)
- Spark application
- building / Introduction
- submitting, on cluster / Submitting the Spark application on the cluster
- Spark cluster
- jobs, submitting to / Submitting jobs to the Spark cluster (local)
- Spark job
- submitting, to Spark cluster / Submitting jobs to the Spark cluster (local)
- running, on Mesos / Running the Spark Job on Mesos (local), Running the job
- running, on YARN / Running the Spark Job on YARN (local), How to do it...
- running, in yarn-client mode / Running a Spark job in yarn-client mode
- running, in yarn-cluster mode / Running Spark job in yarn-cluster mode
- Spark job, installing on YARN
- about / Running the Spark Job on YARN (local), How to do it...
- Hadoop cluster, installing / Installing the Hadoop cluster
- HDFS, starting / Starting HDFS and YARN
- Spark assembly, pushing to HDFS / Pushing Spark assembly and dataset to HDFS
- dataset, pushing to HDFS / Pushing Spark assembly and dataset to HDFS
- Spark master and slave
- running / Running the Spark master and slave locally
- Spark Standalone cluster
- running, on EC2 / Running the Spark Standalone cluster on EC2
- AccessKey, creating / Creating the AccessKey and pem file
- pem file, creating / Creating the AccessKey and pem file
- environment variables, setting / Setting the environment variables
- launch script, running / Running the launch script
- installation, verifying / Verifying installation
- changes, making to code / Making changes to the code
- data, transferring / Transferring the data and job files
- job files, transferring / Transferring the data and job files
- dataset, loading into HDFS / Loading the dataset into HDFS
- job, running / Running the job
- destroying / Destroying the cluster
- Spark Streaming
- used, for subscribing to Twitter stream / Using Spark Streaming to subscribe to a Twitter stream
- Stochastic Gradient Descent (SGD) / Gradient descent
- StreamingLogisticRegression, used for classifying Twitter stream
- about / Using StreamingLogisticRegression to classify a Twitter stream using Kafka as a training stream
- subscription, to Kafka stream / How to do it...
- classification model, training / How to do it...
- live Twitter stream, classifying / How to do it...
- Student dataset
- URL / Loading more than 22 features into classes
- supervised learning / Supervised and unsupervised learning
- Support Vector Machine (SVM)
- about / Binary classification using LogisticRegression and SVM
T
- time series MultiPlot, creating with Bokeh-Scala
- about / Creating a time series MultiPlot with Bokeh-Scala, How to do it...
- data, preparing / Preparing our data
- Plot, creating / Creating a plot
- line joining to all data points, creating / Creating a line that joins all the data points
- and y axes data ranges for plot, setting / Setting the x and y axes' data range for the plot
- axes, drawing / Drawing the axes and the grids
- grids, drawing / Drawing the axes and the grids
- tools, adding / Adding tools
- legend, adding to plot / Adding a legend to the plot
- multiple plots, creating in document / Multiple plots in the document
- URL / Multiple plots in the document
- toDF() function / How to do it...
- twitter-chill project
- URL / Saving RDD[StudentAvro] in a Parquet file
- twitter4j library
- URL / How to do it...
- Twitter app
- URL / How to do it...
- Twitter data
- analyzing, with GraphX / How to do it...
- Twitter stream
- subscribing to / Using Spark Streaming to subscribe to a Twitter stream
U
- Uber JAR
- building / Building the Uber JAR, How to do it...
- transitive dependency stated explicitly, in SBT dependency / Transitive dependency stated explicitly in the SBT dependency
- different libraries dependency, on same library / Two different libraries depend on the same external library
- unsupervised learning / Supervised and unsupervised learning
V
- vector concatenation
- about / Concatenating two vectors
- vector of Int, converting to vector of Double / Converting a vector of Int to a vector of Double
- basic statistics, computing / Computing basic statistics
- mean, calculating / Mean and variance
- variance, calculating / Mean and variance
- vectors
- working with / Working with vectors, Getting ready
- creating / Creating vectors
- constructing, from values / Constructing a vector from values
- zero vector, creating / Creating a zero vector
- creating, out of function / Creating a vector out of a function
- vector of linearly spaced values, creating / Creating a vector of linearly spaced values
- vector with values, creating in specific range / Creating a vector with values in a specific range
- entire vector with single value, creating / Creating an entire vector with a single value
- sub-vector, slicing from bigger vector / Slicing a sub-vector from a bigger vector
- Breeze vector, creating from Scala vector / Creating a Breeze Vector from a Scala Vector
- arithmetic / Vector arithmetic
- scalar operations / Scalar operations
- dot product of two vectors, creating / Calculating the dot product of two vectors
- creating, by adding two vectors / Creating a new vector by adding two vectors together
- appending / Appending vectors and converting a vector of one type to another
- converting from one type to another / Appending vectors and converting a vector of one type to another
- concatenating / Concatenating two vectors
- standard deviation / Standard deviation
- largest value, finding / Find the largest value in a vector
- sum, finding / Finding the sum, square root and log of all the values in the vector
- log, finding / Finding the sum, square root and log of all the values in the vector
- square root, finding / Finding the sum, square root and log of all the values in the vector
- Sqrt function / Finding the sum, square root and log of all the values in the vector
- Log function / Finding the sum, square root and log of all the values in the vector
- with randomly distributed values / Vectors and matrices with randomly distributed values, How it works...
- with uniformly distributed random values, creating / Creating vectors with uniformly distributed random values
- with normally distributed random values, creating / Creating vectors with normally distributed random values
- with random values with Poisson distribution, creating / Creating vectors with random values that have a Poisson distribution
W
- Worker nodes / There's more…
Y
- YARN
- Spark job, running / Running the Spark Job on YARN (local), How to do it...
Z
- Zeppelin
- used, for visualizing / Visualizing using Zeppelin
- URL / Installing Zeppelin
- installing / Installing Zeppelin
- server, customizing / Customizing Zeppelin's server and websocket port
- websocket port, customizing / Customizing Zeppelin's server and websocket port
- data, visualizing on HDFS / Visualizing data on HDFS – parameterizing inputs
- inputs, parameterizing / Visualizing data on HDFS – parameterizing inputs
- custom functions, running / Running custom functions
- external dependencies, adding / Adding external dependencies to Zeppelin
- external Spark cluster, pointing to / Pointing to an external Spark cluster
- Zookeeper
- starting / How to do it...
- URL / How to do it...