Orchestrating jobs with Azure Databricks
Until now, we have been able to use data stored in either an S3 bucket or Azure Blob storage, transform it using PySpark or SQL, and then persist the transformed data into a table. Now, the question is: Which methods do we have to integrate this into a complete ETL? One of the options that we have is to use ADF to integrate our Azure Databricks notebook as one of the steps in our data architecture.
In the next example, we will use ADF in order to trigger our notebook by directly passing the name of the file that contains the data we want to process and use this to update our voting turnout table. For this, you will require the following:
- An Azure subscription
- An Azure Databricks notebook attached to a running container
- The
Voting_Turnout_US_2020
dataset loaded into a Spark dataframe
ADF
ADF is the Azure cloud platform for the integration of serverless data transformation and aggregation processes. It can integrate...