Singular Value Decomposition (SVD) to reduce high-dimensionality in Spark
In this recipe, we will explore a dimensionality method straight out of the linear algebra, which is called SVD (Singular Value Decomposition). The key focus here is to come up with a set of low-rank matrices (typically three) that approximates the original matrix but with much less data, rather than choosing to work with a large M by N matrix.
SVD is a simple linear algebra technique that transforms the original data to eigenvector/eigenvalue low rank matrices that can capture most of the attributes (the original dimensions) in a much more efficient low rank matrix system.
The following figure depicts how SVD can be used to reduce dimensions and then use the S matrix to keep or eliminate higher-level concepts derived from the original data (that is, a low rank matrix with fewer columns/features than the original):
How to do it...
- We will use the movie rating data for the SVD analysis. The movieLens 1M dataset contains...