Classifying the vertices of a graph using Power Iteration Clustering (PIC) in Spark 2.0
This is a classification method for the vertices of a graph given their similarities as defined by their edges. It uses the GraphX library which is ships out of the box with Spark to implement the algorithm. Power Iteration Clustering is similar to other Eigen Vector/Eigen Value decomposition algorithms but without the overhead of matrix decomposition. It is suitable when you have a large sparse matrix (for example, graphs depicted as a sparse matrix).
GraphFrames will be the replacement/interface proper for the GraphX library going forward (https://databricks.com/blog/2016/03/03/introducing-graphframes.html).
How to do it...
Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.
Set up the package location where the program will reside:
package spark.ml.cookbook.chapter8
- Import the necessary packages for Spark context to get access to the cluster and
Log4j...