Loading JSON into DataFrames
JSON has become the most common text-based data representation format these days. In this recipe, we'll see how to load data represented as JSON into our DataFrame. To make it more interesting, let's have our JSON in HDFS instead of our local filesystem.
The Hadoop Distributed File System (HDFS) is a highly distributed filesystem that is both scalable and fault tolerant. It is a critical part of the Hadoop ecosystem and is inspired by the Google File System paper (http://research.google.com/archive/gfs.html). More details about the architecture and communication protocols on HDFS can be found at http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.
How to do it…
In this recipe, we'll see three subrecipes:
- How to create a schema-inferenced DataFrame from JSON using
sqlContext.jsonFile
- Alternatively, if we prefer to preprocess the input file before parsing it into JSON, we'll parse the input file as text and convert it into JSON using
sqlContext...