Batch ETL process demo
Databricks professionals often talk about a medallion architecture. In this architecture, data processing in data pipelines is divided into three categories. We call them the bronze, silver, and gold layers. The bronze layer is often the raw data, the silver is the cleansed data, and the gold layer consists of aggregated or modeled data.
Check out https://databricks.com/solutions/data-pipelines for more information. In this section, we will walk through a real-world batch ETL process. We will perform the following steps:
- Read the data and create a Spark DataFrame.
- Perform transformations to clean the data and implement business logic.
- Write the DataFrame in the Delta Lake.
- Create a Delta table from written data and perform exploratory data analysis.
The dataset that we will be working with is part of databricks-datasets
and located in the following directory:
dbfs:/databricks-datasets/samples/lending_club/parquet/
So, create...