Reading and writing data from and to CSV and ParquetÂ
Azure Databricks supports multiple file formats, including sequence files, Record Columnar files, and Optimized Row Columnar files. It also provides native support for CSV, JSON, and Parquet file formats.
Parquet is the most widely used file format in the Databricks Cloud for the following reasons:
- Columnar storage format—Stores data column-wise, unlike row-based format files such as Avro and CSV.
- Open source—Parquet is open source and free to use.
- Aggressive compression—Parquet supports compression, which is not available in most file formats. Because of its compression technique, it requires slow storage compared to other file formats. It uses different encoding methods for compressions.
- Performance—The Parquet file format is designed for optimized performance. You can get the relevant data quickly as it saves both data and metadata. The amount of data scanned is comparatively smaller, resulting in less input/output (I/O) usage.
- Schema evaluation—It supports changes in the column schema as required. Multiple Parquet files with compatible schemas can be merged.
- Self-describing—Each Parquet file contains metadata and data, which makes it self-describing.
Parquet files also support predicate push-down, column filtering, static, and dynamic partition pruning.
In this recipe, you will learn how to read from and write to CSV and Parquet files using Azure Databricks.
Getting ready
You can follow the steps by running the steps in the 2_7.Reading and Writing data from and to CSV, Parquet.ipynb
notebook in your local cloned repository in the Chapter02
folder.
Upload the csvFiles
folder in the Chapter02/Customer
folder to the ADLS Gen2 storage account in the rawdata
file system and in Customer/csvFiles
folder.
How to do it…
Here are the steps and code samples for reading from and writing to CSV and Parquet files using Azure Databricks. You will find a separate section for processing CSV and Parquet file formats.
Working with the CSV file format
Go through the following steps for reading CSV files and saving data in CSV format.
- Ensure that you have mounted the ADLS Gen2 Storage location. If not, you can refer to the Mounting ADLS Gen2 and Azure Blob storage to Azure DBFS recipe in this chapter to follow the steps for mounting a storage account.
- Run the following code to list the CSV data files from the mounted ADLS Gen2 storage account:
#Listing CSV Files dbutils.fs.ls("/mnt/Gen2Source/Customer/csvFiles")
- Read the customer data stored in
csv
files in the ADLS Gen2 storage account by running the following code:customerDF = spark.read.format("csv").option("header",True).option("inferSchema", True).load("/mnt/Gen2Source/Customer/csvFiles")
- You can display the result of a Dataframe by running the following code:
customerDF.show()
- By running the following code, we are writing
customerDF
DataFrame data to the location/mnt/Gen2Source/Customer/WriteCsvFiles
in CSV format.customerDF.write.mode("overwrite").option("header", "true").csv("/mnt/Gen2Source/Customer/WriteCsvFiles")
- To confirm that the data is written to the target folder in
csv
format, let's read thecsv
files fromtarget
folder by running the following code.targetDF = spark.read.format("csv").option("header",True).option("inferSchema", True).load("/mnt/Gen2Source/Customer/WriteCsvFiles") targetDF.show()
In the following section we will learn how to read data from and write data to parquet files.
Working with the Parquet file format
Let's get started.
- You can use the same customer dataset for reading from the CSV files and writing into the Parquet file format.
- We will use the targetDF DataFrame used in Step 6 and save it as
parquet
format by running the following code. We are using save mode asoverwrite
in the following code. Usingoverwrite
save option, existing data is overwritten in the target or destination folder mentioned.#Writing the targetDF data which has the CSV data read as parquet File using append mode targetDF.write.mode("overwrite").option("header", "true").parquet("/mnt/Gen2Source/Customer/csvasParquetFiles/")
- In the following code, we are reading data from
csvasParquetFiles
folder to confirm the data in parquet format:df_parquetfiles=spark.read.format("parquet").option("header",True).load("/mnt/Gen2Source/Customer/csvasParquetFiles/") display(df_parquetfiles.limit(5))
- Let's change the save mode from
overwrite
toappend
by running the following code. Using save mode asappend
, new data will be inserted, and existing data is preserved in the target or destination folder:#Using overwrite as option for save mode targetDF.write.mode("append").option("header", "true").parquet("/mnt/Gen2Source/Customer/csvasParquetFiles/")
- Run the following code to check the count of records in the parquet folder and number should increase as we have appended the data to the same folder.
df_parquetfiles=spark.read.format("parquet").option("header",True).load("/mnt/Gen2Source/Customer/csvasParquetFiles/") df_parquetfiles.count()
By the end of this recipe, you have learnt how to read from and write to CSV and Parquet files.
How it works…
The CSV file format is a widely used format by many tools, and it's also a default format for processing data. There are many disadvantages when you compare it in terms of cost, query processing time, and size of the data files. The CSV format is not that effective compared with what you will find in the Parquet file format. Also, it doesn't support partition pruning, which directly impacts the cost of storing and processing data in CSV format.
Conversely, Parquet is a columnar format that supports compression and partition pruning. It is widely used for processing data in big data projects for both reading and writing data. A Parquet file stores data and metadata, which makes it self-describing.
Parquet also supports schema evolution, which means you can change the schema of the data as required. This helps in developing systems that can accommodate changes in the schema as it matures. In such cases, you may end up with multiple Parquet files that have different schemas but are compatible.