Just Enough Linear Algebra for Machine Learning with Spark

In this chapter, we will cover the following recipes:

Package imports and initial setup for vectors and matrices
Creating DenseVector and setup with Spark 2.0
Creating SparseVector and setup with Spark 2.0
Creating DenseMatrix and setup with Spark 2.0
Using sparse local matrices with Spark 2.0
Performing vector arithmetic using Spark 2.0
Performing matrix arithmetic with Spark 2.0
Distributed matrices in Spark 2.0 ML library
Exploring RowMatrix in Spark 2.0
Exploring distributed IndexedRowMatrix in Spark 2.0
Exploring distributed CoordinateMatrix in Spark 2.0
Exploring distributed BlockMatrix in Spark 2.0

Introduction

Linear algebra is the cornerstone of machine learning (ML) and mathematical programming (MP). When dealing with Spark's machine library, one must understand that the Vector/Matrix structures provided by Scala (imported by default) are different from the Spark ML, MLlib Vector, Matrix facilities provided by Spark. The latter, powered by RDDs, is the desired data structure if you are going to use Spark (that is, parallelism) out of the box for large-scale matrix/vector computation (for example, SVD implementation alternatives with more numerical accuracy, desired in some cases for derivatives pricing and risk analytics). The Scala Vector/Matrix libraries provide a rich set of linear algebra operations such as dot product, additions, and so on, that still have their own place in an ML pipeline. In summary, the key difference between using Scala Breeze and Spark...

Package imports and initial setup for vectors and matrices

Before we can program in Spark or use vector and matrix artifacts, we need to first import the right packages and then set up SparkSession so we can gain access to the cluster handle.

In this short recipe, we highlight a comprehensive number of packages that can cover most of the linear algebra operations in Spark. The individual recipes that follow will include the exact subset required for the specific program.

How to do it...

Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
Set up the package location where the program will reside:

package spark.ml.cookbook.chapter2

Import the necessary packages...

Creating DenseVector and setup with Spark 2.0

In this recipe, we explore DenseVectors using the Spark 2.0 machine library.

Spark provides two distinct types of vector facilities (dense and sparse) for storing and manipulating feature vectors that are going to be used in machine learning or optimization algorithms.

How to do it...

In this section, we examine DenseVector examples that you would most likely use for implementing/augmenting existing machine learning programs. These examples also help to better understand Spark ML or MLlib source code and the underlying implementation (for example, Single Value Decomposition).
Here we look at creating an ML vector feature (with independent variables) from arrays, which is a common...

Creating SparseVector and setup with Spark

In this recipe, we examine several types of SparseVector creation. As the length of the vector increases (millions) and the density remains low (few non-zero members), then sparse representation becomes more and more advantageous over the DenseVector.

How to do it...

Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
Import the necessary packages for vector and matrix manipulation:

import org.apache.spark.sql.{SparkSession}
import org.apache.spark.mllib.linalg._
import breeze.linalg.{DenseVector => BreezeVector}
import Array._
import org.apache.spark.mllib.linalg.SparseVector

Set up the Spark context and application...

Creating dense matrix and setup with Spark 2.0

In this recipe, we explore matrix creation examples that you most likely would need in your Scala programming and while reading the source code for many of the open source libraries for machine learning.

Spark provides two distinct types of local matrix facilities (dense and sparse) for storage and manipulation of data at a local level. For simplicity, one way to think of a matrix is to visualize it as columns of Vectors.

Getting ready

The key to remember here is that the recipe covers local matrices stored on one machine. We will use another recipe, Distributed matrices in the Spark2.0 ML library, covered in this chapter, for storing and manipulating distributed matrices.

...

Using sparse local matrices with Spark 2.0

In this recipe, we concentrate on SparseMatrix creation. In the previous recipe, we saw how a local dense matrix is declared and stored. A good number of machine learning problem domains can be represented as a set of features and labels within the matrix. In large-scale machine learning problems (for example, progression of a disease through large population centers, security fraud, political movement modeling, and so on), a good portion of the cells will be 0 or null (for example, the current number of people with a given disease versus the healthy population).

To help with storage and efficient operation in real time, sparse local matrices specialize in storing the cells efficiently as a list plus an index, which leads to faster loading and real time operations.

...

Introduction

Linear algebra is the cornerstone of machine learning (ML) and mathematicalprogramming (MP). When dealing with Spark's machine library, one must understand that the Vector/Matrix structures by Scala (imported by default) are different from the Spark ML, MLlib Vector, Matrix facilities provided by Spark. The latter, powered by RDDs, is the desired data structure if you are going to use Spark (that is, parallelism) out of the box for large-scale matrix/vector computation (for example, SVD implementation alternatives with more numerical accuracy, desired in some cases for derivatives pricing and risk analytics). The Scala Vector/Matrix libraries provide a rich set of linear algebra operations such as dot product, additions, and so on, that still have their own place in an ML pipeline. In summary, the key difference between using Scala Breeze and Spark or Spark ML is that the Spark facility is backed by RDDs which allows for simultaneous distributed, concurrent computing, and resiliency...

Package imports and initial setup for vectors and matrices

Before we can program in Spark or use and matrix artifacts, we need to first import the right packages and then set up SparkSession so we can gain access to the cluster handle.

In this short recipe, we highlight a comprehensive number of packages that can cover most of the linear algebra operations in Spark. The individual recipes that follow will include the subset required for the specific program.

How to do it...

Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
Set up the package location where the program will reside:

package spark.ml.cookbook.chapter2

Import the necessary packages for vector and matrix manipulation:

import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix}
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
import org.apache.spark.sql.{SparkSession...

Creating DenseVector and setup with Spark 2.0

In this recipe, we explore DenseVectors using the Spark 2.0 library.

Spark provides two types of vector facilities (dense and sparse) for storing and manipulating feature vectors that are going to be used in learning or optimization algorithms.

How to do it...

In this section, we examine DenseVector examples that you would most likely use for implementing/augmenting existing machine learning programs. These examples also help to better understand Spark ML or MLlib source code and the underlying implementation (for example, Single Value Decomposition).
Here we look at creating an ML vector feature (with independent variables) from arrays, which is a common use case. In this case, we have three almost fully populated Scala arrays corresponding to customer and product feature sets. We convert these arrays to the corresponding DenseVectors in Scala:

val CustomerFeatures1: Array[Double] = Array(1,3,5,7,9,1,3,2,4,5,6,1,2,5,3,7,4,3,4,1)
 val CustomerFeatures2...

Creating SparseVector and setup with Spark

In this recipe, we several types of SparseVector creation. As the length of the vector increases (millions) and the density remains low (few non-zero members), then sparse representation more and more advantageous over the DenseVector.

How to do it...

Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
Import the necessary packages for vector and matrix manipulation:

import org.apache.spark.sql.{SparkSession}
import org.apache.spark.mllib.linalg._
import breeze.linalg.{DenseVector => BreezeVector}
import Array._
import org.apache.spark.mllib.linalg.SparseVector

Set up the Spark context and application parameters so Spark can run. See the first recipe in this chapter for more details and variations:

val spark = SparkSession
 .builder
 .master("local[*]")
 .appName("myVectorMatrix")
 .config("spark.sql.warehouse.dir", ".")
 .getOrCreate()

Here we look at creating a ML SparseVector that corresponds...

Creating dense matrix and setup with Spark 2.0

In this recipe, we explore creation examples that you most likely would need in your Scala programming and while reading the source code for many of the open source libraries for machine learning.

Spark provides two distinct types of local matrix facilities (dense and sparse) for storage and manipulation of data at a local level. For simplicity, one way to think of a is to visualize it as columns of Vectors.

Getting ready

The key to remember here is that the recipe covers local matrices stored on one machine. We will use another recipe, Distributed matrices in the Spark2.0 ML library, covered in this chapter, for storing and manipulating distributed matrices.

How to do it...

Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
Import the necessary packages for vector and matrix manipulation:

 import org.apache.spark.sql.{SparkSession}
 import org.apache.spark.mllib.linalg._
 import breeze...

Using sparse local matrices with Spark 2.0

In this recipe, we concentrate on creation. In the recipe, we saw how a local dense matrix is declared and stored. A good number of machine learning problem domains can be represented as a set of features and labels within the matrix. In large-scale machine learning problems (for example, progression of a disease through large population centers, security fraud, political movement modeling, and so on), a good portion of the cells will be 0 or null (for example, the current number of people with a given disease versus the healthy population).

To help with storage and efficient operation in real time, sparse local matrices specialize in storing the cells efficiently as a list plus an index, which leads to faster loading and real time operations.

How to do it...

Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
Import the necessary packages for vector and matrix manipulation: