Processing data using Spark pools and a lake database
Spark pools in a Synapse workspace allow us to process data and store them as tables inside a lake database. A lake database allows us to create tables using CSV files, Parquet files, or as Delta tables stored in the data lake account. Delta tables use Parquet files for storage and support insert
, update
, delete
, and merge
operations. Delta tables are stored in a columnar format, which is compressed, ideal for storing processed data and supporting analytic workloads. In this recipe, we will read a CSV file, perform basic processing, and load the data into a Delta table in a lake database.
Getting ready
Create a Synapse Analytics workspace, as explained in the Provisioning an Azure Synapse Analytics workspace recipe.
Create a Spark pool cluster, as explained in the Provisioning and configuring Spark pools recipe.
We need to upload the covid-data.csv
file from https://github.com/PacktPublishing/Azure-Data-Engineering...