Distributed graph computation with GraphX
GraphX (https://spark.apache.org/graphx/) is a distributed graph processing library that is designed to work with Spark. Like the MLlib library we used in the previous chapter, GraphX provides a set of abstractions that are built on top of Spark's RDDs. By representing the vertices and edges of a graph as RDDs, GraphX is able to process very large graphs in a scalable way.
We've seen in previous chapters how to process a large dataset using MapReduce and Hadoop. Hadoop is an example of a data-parallel system: the dataset is divided into groups that are processed in parallel. Spark is also a data-parallel system: RDDs are distributed across the cluster and processed in parallel.
Data-parallel systems are appropriate ways of scaling data processing when your data closely resembles a table. Graphs, which may have complex internal structure, are not most efficiently represented as tables. Although graphs can be represented as edge lists, as...