You're reading from Azure Databricks Cookbook Accelerate and scale real-time analytics solutions using the Apache Spark-based analytics service

Product type Paperback

Published in Sep 2021

Publisher Packt

ISBN-13 9781789809718

Length 452 pages

Edition 1st Edition

Languages

SQL

Tools

Azure

Concepts

Data Streaming

Authors (2):

Vinod Jaiswal

Phani Raj

View More author details

Table of Contents (12) Chapters

Preface

1. Chapter 1: Creating an Azure Databricks Service

2. Chapter 2: Reading and Writing Data from and to Various Azure Services and File Formats FREE CHAPTER

3. Chapter 3: Understanding Spark Query Execution

4. Chapter 4: Working with Streaming Data

5. Chapter 5: Integrating with Azure Key Vault, App Configuration, and Log Analytics

6. Chapter 6: Exploring Delta Lake in Azure Databricks

7. Chapter 7: Implementing Near-Real-Time Analytics and Building a Modern Data Warehouse

8. Chapter 8: Databricks SQL

9. Chapter 9: DevOps Integrations and Implementing CI/CD for Azure Databricks

10. Chapter 10: Understanding Security and Monitoring in Azure Databricks

11. Other Books You May Enjoy

Reading and writing data from and to JSON, including nested JSON

Spark SQL automatically detects the JSON dataset schema from the files and loads it as a DataFrame. It also provides an option to query JSON data for reading and writing data. Nested JSON can also be parsed, and fields can be directly accessed without any explicit transformations.

Getting ready

You can follow the steps by running the steps in the 2_8.Reading and Writing data from and to Json including nested json.iynpb notebook in your local cloned repository in the Chapter02 folder.

Upload the folder JsonData from Chapter02/sensordata folder to ADLS Gen-2 account having sensordata as file system . We are mounting ADLS Gen-2 Storage Account with sensordata file system to /mnt/SensorData.

The JsonData has two folders, SimpleJsonData which has files simple JSON structure and JsonData folder which has files with nested JSON structure.

Note

The code was tested on Databricks Runtime Version 7.3 LTS having Spark 3.0.1.

In the upcoming section we will learn how to process simple and complex JSON datafile. We will use sensordata files with simple and nested schema.

How to do it…

In this section, you will see how you can read and write the simple JSON data files:

You can read JSON datafiles using below code snippet. You need to specify multiline option as true when you are reading JSON file having multiple lines else if its single line JSON datafile this can be skipped.
```
df_json = spark.read.option("multiline","true").json("/mnt/SensorData/JsonData/SimpleJsonData/")
display(df_json)
```
After executing the preceding code, you can see the schema of the json data.
Figure 2.21 – Simple Json Data
Writing json file is identical to a CSV file. You can use following code snippet to write json file using Azure Databricks. You can specify different mode options while writing JSON data like append, overwrite, ignore and error or errorifexists. error mode is the default one and this mode throws exceptions if data already exists.
```
multilinejsondf.write.format("json").mode("overwrite).save("/mnt/SensorData/JsonData/SimpleJsonData/")
```

Very often, you will come across scenarios where you need to process complex datatypes such as arrays, maps, and structs while processing data.

To encode a struct type as a string and to read the struct type as a complex type, Spark provides functions such as to_json() and from_json(). If you are receiving data from the streaming source in nested JSON format, then you can use the from_json() function to process the input data before sending it downstream.

Similarly you can use to_json() method to encode or convert columns in DataFrame to JSON string and send the dataset to various destination like EventHub, Data Lake storage, Cosmos database, RDBMS systems like SQL server, Oracle etc. You can follow along the steps required to process simple and nested Json in the following steps.

Execute the following code to display the dataset from the mount location of storage account.
```
df_json = spark.read.option("multiline","true").json("dbfs:/mnt/SensorData/JsonData/")
```
You can see that the vehicle sensor data has Owner's attribute in the multiline json format.
Figure 2.22 – Complex Json Data
To convert Owners attribute into row format for joining or transforming data you can use explode() function. Execute the following command for performing this operation. Here you can see using explode() function the elements of Array has been converted into two columns named name and phone.
Figure 2.23 – Explode function
To encode or convert the columns in the DataFrame to JSON string, to_json() method can be used. In this example we will create new DataFrame with json column from the existing DataFrame data_df in preceding step. Execute the following code to create the new DataFrame with json column.
```
jsonDF = data_df.withColumn("jsonCol", to_json(struct([data_df[x] for x in data_df.columns]))) .select("jsonCol")
display(jsonDF)
```
After executing the preceding command, you will get new DataFrame with column named jsonCol.
Figure 2.24 – Encode To Json
Using to_json() method you have converted _id, name and phone column  to a new json column.

How it works…

Spark provides you with options to process data from multiple sources and in multiple formats. You can process the data and enrich it before sending it to downstream applications. Data can be sent to a downstream application with low latency on streaming data or high throughput on historical data.

You're reading from Azure Databricks Cookbook Accelerate and scale real-time analytics solutions using the Apache Spark-based analytics service

Table of Contents (12) Chapters

Reading and writing data from and to JSON, including nested JSON

Getting ready

How to do it…

How it works…

Authors (2)

Other recommended products

Personalised recommendations for you