Analyzing JSON input modeled as a graph
In this section, we will analyze a JSON Dataset modeled as a graph. We will apply GraphFrame functions from the previous sections and introduce some new ones.
For hands-on exercises in this section, we use a Dataset containing Amazon product metadata; product information and reviews on around 548,552 products. This Dataset be downloaded from https://snap.stanford.edu/data/amazon-meta.html.
For processing simplicity, the original Dataset was converted to a JSON format file each line representing a complete record. Use the Java program (Preprocess.java
) provided with this chapter for the conversion.
First, we create a DataFrame from the input file, and print out the schema and a few sample records. It is a complex schema with nested elements:
scala> val df1 = spark.read.json("file:///Users/aurobindosarkar/Downloads/input.json") scala> df1.printSchema() root |-- ASIN: string (nullable = true) |-- Id: long (nullable = true) |-- ReviewMetaData: struct...