Using Spark with JSON data
JSON is a simple, flexible, and format used extensively as a data-interchange format in web services. Spark's support for JSON is great. There is no need for defining the schema for the data, as the schema is automatically inferred. In addition, Spark greatly simplifies the query syntax required to access fields in complex JSON data structures. We will present detailed examples of JSON data in Chapter 12, Spark SQL in Large-Scale Application Architectures.
The dataset for this example contains approximately 1.69 million Amazon reviews for the electronics category, and can be downloaded from: http://jmcauley.ucsd.edu/data/amazon/.
We can directly read a JSON dataset to create Spark SQL DataFrame. We will read in a sample set of order records from a JSON file:
scala> val reviewsDF = spark.read.json("file:///Users/aurobindosarkar/Downloads/reviews_Electronics_5.json")
You can print the schema of the newly created DataFrame to verify the fields and their characteristics...