You're reading from Azure Databricks Cookbook Accelerate and scale real-time analytics solutions using the Apache Spark-based analytics service

Product type Paperback

Published in Sep 2021

Publisher Packt

ISBN-13 9781789809718

Length 452 pages

Edition 1st Edition

Languages

SQL

Tools

Azure

Concepts

Data Streaming

Authors (2):

Vinod Jaiswal

Phani Raj

View More author details

Table of Contents (12) Chapters

Preface

1. Chapter 1: Creating an Azure Databricks Service

2. Chapter 2: Reading and Writing Data from and to Various Azure Services and File Formats FREE CHAPTER

3. Chapter 3: Understanding Spark Query Execution

4. Chapter 4: Working with Streaming Data

5. Chapter 5: Integrating with Azure Key Vault, App Configuration, and Log Analytics

6. Chapter 6: Exploring Delta Lake in Azure Databricks

7. Chapter 7: Implementing Near-Real-Time Analytics and Building a Modern Data Warehouse

8. Chapter 8: Databricks SQL

9. Chapter 9: DevOps Integrations and Implementing CI/CD for Azure Databricks

10. Chapter 10: Understanding Security and Monitoring in Azure Databricks

11. Other Books You May Enjoy

Reading and writing data from and to CSV and Parquet

Azure Databricks supports multiple file formats, including sequence files, Record Columnar files, and Optimized Row Columnar files. It also provides native support for CSV, JSON, and Parquet file formats.

Parquet is the most widely used file format in the Databricks Cloud for the following reasons:

Columnar storage format—Stores data column-wise, unlike row-based format files such as Avro and CSV.
Open source—Parquet is open source and free to use.
Aggressive compression—Parquet supports compression, which is not available in most file formats. Because of its compression technique, it requires slow storage compared to other file formats. It uses different encoding methods for compressions.
Performance—The Parquet file format is designed for optimized performance. You can get the relevant data quickly as it saves both data and metadata. The amount of data scanned is comparatively smaller, resulting in less input/output (I/O) usage.
Schema evaluation—It supports changes in the column schema as required. Multiple Parquet files with compatible schemas can be merged.
Self-describing—Each Parquet file contains metadata and data, which makes it self-describing.

Parquet files also support predicate push-down, column filtering, static, and dynamic partition pruning.

In this recipe, you will learn how to read from and write to CSV and Parquet files using Azure Databricks.

Getting ready

You can follow the steps by running the steps in the 2_7.Reading and Writing data from and to CSV, Parquet.ipynb notebook in your local cloned repository in the Chapter02 folder.

Upload the csvFiles folder in the Chapter02/Customer folder to the ADLS Gen2 storage account in the rawdata file system and in Customer/csvFiles folder.

How to do it…

Here are the steps and code samples for reading from and writing to CSV and Parquet files using Azure Databricks. You will find a separate section for processing CSV and Parquet file formats.

Working with the CSV file format

Go through the following steps for reading CSV files and saving data in CSV format.

Ensure that you have mounted the ADLS Gen2 Storage location. If not, you can refer to the Mounting ADLS Gen2 and Azure Blob storage to Azure DBFS recipe in this chapter to follow the steps for mounting a storage account.
Run the following code to list the CSV data files from the mounted ADLS Gen2 storage account:
```
#Listing CSV Files
dbutils.fs.ls("/mnt/Gen2Source/Customer/csvFiles")
```

Read the customer data stored in csv files in the ADLS Gen2 storage account by running the following code:

customerDF = spark.read.format("csv").option("header",True).option("inferSchema", True).load("/mnt/Gen2Source/Customer/csvFiles")

You can display the result of a Dataframe by running the following code:
```
customerDF.show()
```
By running the following code, we are writing customerDF DataFrame data to the location /mnt/Gen2Source/Customer/WriteCsvFiles in CSV format.
```
customerDF.write.mode("overwrite").option("header", "true").csv("/mnt/Gen2Source/Customer/WriteCsvFiles")
```

To confirm that the data is written to the target folder in csv format, let's read the csv files from target folder by running the following code.

targetDF = spark.read.format("csv").option("header",True).option("inferSchema", True).load("/mnt/Gen2Source/Customer/WriteCsvFiles")
targetDF.show()

In the following section we will learn how to read data from and write data to parquet files.

Working with the Parquet file format

Let's get started.

You can use the same customer dataset for reading from the CSV files and writing into the Parquet file format.
We will use the targetDF DataFrame used in Step 6 and save it as parquet format by running the following code. We are using save mode as overwrite in the following code. Using overwrite save option, existing data is overwritten in the target or destination folder mentioned.
```
#Writing the targetDF data which has the CSV data read as parquet File using append mode
targetDF.write.mode("overwrite").option("header", "true").parquet("/mnt/Gen2Source/Customer/csvasParquetFiles/") 
```

In the following code, we are reading data from csvasParquetFiles folder to confirm the data in parquet format:

df_parquetfiles=spark.read.format("parquet").option("header",True).load("/mnt/Gen2Source/Customer/csvasParquetFiles/") 
display(df_parquetfiles.limit(5))

Let's change the save mode from overwrite to append by running the following code. Using save mode as append, new data will be inserted, and existing data is preserved in the target or destination folder:
```
#Using overwrite as option for save mode
targetDF.write.mode("append").option("header", "true").parquet("/mnt/Gen2Source/Customer/csvasParquetFiles/") 
```
Run the following code to check the count of records in the parquet folder and number should increase as we have appended the data to the same folder.
```
df_parquetfiles=spark.read.format("parquet").option("header",True).load("/mnt/Gen2Source/Customer/csvasParquetFiles/")
df_parquetfiles.count()
```

By the end of this recipe, you have learnt how to read from and write to CSV and Parquet files.

How it works…

The CSV file format is a widely used format by many tools, and it's also a default format for processing data. There are many disadvantages when you compare it in terms of cost, query processing time, and size of the data files. The CSV format is not that effective compared with what you will find in the Parquet file format. Also, it doesn't support partition pruning, which directly impacts the cost of storing and processing data in CSV format.

Conversely, Parquet is a columnar format that supports compression and partition pruning. It is widely used for processing data in big data projects for both reading and writing data. A Parquet file stores data and metadata, which makes it self-describing.

Parquet also supports schema evolution, which means you can change the schema of the data as required. This helps in developing systems that can accommodate changes in the schema as it matures. In such cases, you may end up with multiple Parquet files that have different schemas but are compatible.