Reading and writing data from and to JSON, including nested JSONÂ
Spark SQL automatically detects the JSON dataset schema from the files and loads it as a DataFrame. It also provides an option to query JSON data for reading and writing data. Nested JSON can also be parsed, and fields can be directly accessed without any explicit transformations.
Getting ready
You can follow the steps by running the steps in the 2_8.Reading and Writing data from and to Json including nested json.iynpb
notebook in your local cloned repository in the Chapter02
folder.
Upload the folder JsonData
from Chapter02/sensordata
folder to ADLS Gen-2 account having sensordata
as file system . We are mounting ADLS Gen-2 Storage Account with sensordata file system to /mnt/SensorData
.
The JsonData
has two folders, SimpleJsonData
which has files simple JSON structure and JsonData
folder which has files with nested JSON structure.
Note
The code was tested on Databricks Runtime Version 7.3 LTS having Spark 3.0.1.
In the upcoming section we will learn how to process simple and complex JSON datafile. We will use sensordata files with simple and nested schema.
How to do it…
In this section, you will see how you can read and write the simple JSON data files:
- You can read JSON datafiles using below code snippet. You need to specify multiline option as true when you are reading JSON file having multiple lines else if its single line JSON datafile this can be skipped.
df_json = spark.read.option("multiline","true").json("/mnt/SensorData/JsonData/SimpleJsonData/") display(df_json)
- After executing the preceding code, you can see the schema of the json data.
- Writing json file is identical to a CSV file. You can use following code snippet to write json file using Azure Databricks. You can specify different mode options while writing JSON data like append, overwrite, ignore and error or errorifexists. error mode is the default one and this mode throws exceptions if data already exists.
multilinejsondf.write.format("json").mode("overwrite).save("/mnt/SensorData/JsonData/SimpleJsonData/")
Very often, you will come across scenarios where you need to process complex datatypes such as arrays, maps, and structs while processing data.
To encode a struct type as a string and to read the struct type as a complex type, Spark provides functions such as to_json()
and from_json()
. If you are receiving data from the streaming source in nested JSON format, then you can use the from_json()
function to process the input data before sending it downstream.
Similarly you can use to_json()
method to encode or convert columns in DataFrame to JSON string and send the dataset to various destination like EventHub, Data Lake storage, Cosmos database, RDBMS systems like SQL server, Oracle etc. You can follow along the steps required to process simple and nested Json in the following steps.
- Execute the following code to display the dataset from the mount location of storage account.
df_json = spark.read.option("multiline","true").json("dbfs:/mnt/SensorData/JsonData/")
- You can see that the vehicle sensor data has Owner's attribute in the multiline json format.
- To convert Owners attribute into row format for joining or transforming data you can use
explode()
function. Execute the following command for performing this operation. Here you can see usingexplode()
function the elements of Array has been converted into two columns named name and phone. - To encode or convert the columns in the DataFrame to JSON string,
to_json()
method can be used. In this example we will create new DataFrame withjson
column from the existingDataFrame data_df
in preceding step. Execute the following code to create the new DataFrame withjson
column.jsonDF = data_df.withColumn("jsonCol", to_json(struct([data_df[x] for x in data_df.columns]))) .select("jsonCol") display(jsonDF)
- After executing the preceding command, you will get new DataFrame with column named
jsonCol
. - Using
to_json()
method you have converted_id
,name
andphone
column 
to a newjson
column.
How it works…
Spark provides you with options to process data from multiple sources and in multiple formats. You can process the data and enrich it before sending it to downstream applications. Data can be sent to a downstream application with low latency on streaming data or high throughput on historical data.