Deploying Delta Live Tables pipelines with Databricks Asset Bundles
Databricks Asset Bundles is a new feature that allows you to deploy your data, analytics, and machine learning projects as a collection of source files. You can use a YAML file to specify the resources and settings for your project, such as jobs, pipelines, endpoints, experiments, and models. This way, you can easily manage your code and infrastructure in a consistent and automated way.
The following are some of the benefits of using Databricks Asset Bundles:
- You can use best practice tools and processes to work with source code, such as source control, code review, testing, and CI/CD
- You can streamline your local development with IDEs and run your resources before deploying them to production
- You can configure your deployments across multiple workspaces, regions, and clouds
In this recipe, you will learn how to package your pipeline as a Databricks Access Bundle and deploy it to a production workspace.
Getting ready
Before you start, you will need the following:
- A Databricks account with access to a workspace and a cluster.
- The Databricks CLI installed on your local machine. Please make sure you follow the instructions here: https://docs.databricks.com/en/dev-tools/cli/install.html.
- Authentication with a token for the Databricks CLI: https://docs.databricks.com/en/dev-tools/cli/authentication.html.
How to do it…
- Create a Databricks Asset Bundle manually: A Databricks Asset Bundle is a YAML file with the name
databricks.yml
. The file has three main sections:- The
bundle
section specifies the name of the bundle, which isdlt_dabs_cicd
in this case. - The
include
section lists the source files that are part of the bundle. In this case, there is only one file:dlt_dabs_cicd_pipeline.yml
. This is a YAML file that defines a Delta Live Tables pipeline. - The
targets
section defines the workspaces where the bundle can be deployed. There are two targets: dev and prod.- The dev target is used for development purposes. It has the following properties:
- It uses the mode development setting, which adds a prefix such as
[dev my_user_name]
to everything that is deployed to this target. This prevents clashes with other developers who are working on the same project. - It also turns off any schedules and automatic triggers for jobs and turns on development mode for Delta Live Tables pipelines. This lets developers execute and test their code without impacting production data or resources.
- It sets the default true setting, which means that this is the default target for deploying the bundle. If no target is specified, the bundle will be deployed to the dev target.
- It specifies the workspace property, which contains the host URL of the workspace where the bundle will be deployed. In this case, it is
htt ps:// adb-763 7940 27236 1795. 15.azure databr
icks .net
.
- It uses the mode development setting, which adds a prefix such as
- The prod target is used for production deployment. It has the following properties:
- It uses the mode production setting, which means that everything deployed to this target does not get any prefix and uses the original names of the source files.
- It overrides the
workspace.root_path
property, which is the default path where the bundle will be deployed in the workspace. By default, it is/Users/${workspace.current_user.userName}/.bundle/${bundle.target}/${bundle.name}
, which is specific to the current user. In this case, it has been changed to/Shared/.bundle/prod/${bundle.name}
, which is a shared path that is not specific to any user. - It specifies the same workspace property as the dev target, which means that both targets use the same workspace host URL.
- It specifies the
run_as
property, which defines who will run the resources in the bundle. In this case, it uses theuser_name: pulkit.chadha.packt@gmail.com
setting, which means that everything will run as this user. Alternatively, a service principal name could be used here using theservice_principal_name
setting.
- The dev target is used for development purposes. It has the following properties:
- The
- Create resources within the Databricks Asset Bundle: We will only define the bundle with a single Delta Live Tables pipeline. The YAML file has the following sections:
pipelines
: This section defines one or more pipelines that are part of the resource. Each pipeline has a unique name, a target schema, and other settings. In this case, there is only one pipeline:dlt_dabs_cicd_pipeline
.target
: This setting specifies the schema where the output tables of the pipeline will be written. The target schema can be different for different environments, such as development, testing, or production. In this case, the target schema isdlt_dabs_cicd_${bundle.environment}
, where${bundle.environment}
is a variable that can be replaced with the actual environment name.continuous
: This setting controls whether the pipeline runs in continuous mode or not. Continuous mode means that the pipeline will automatically update whenever there is new data available in the input sources. In this case, the pipeline runs in non-continuous mode, which means that it will only update when triggered manually or by a schedule.channel
: This setting specifies the channel of Delta Live Tables that the pipeline uses. A channel is a version of Delta Live Tables that has different features and capabilities. In this case, the pipeline uses the CURRENT channel, which is the latest and most stable version of Delta Live Tables.photon
: This setting controls whether the pipeline uses Photon or not. Photon is a vectorized query engine that can speed up data processing and reduce resource consumption. In this case, the pipeline does not use Photon.libraries
: This section lists the source code libraries that are required for the pipeline. In this case, there is only one library, which is a notebook named9.6 create-medallion-arch-DLT.sql
.clusters
: This section defines the cluster that is used to run the pipeline. In this case, there is only one cluster with a label ofdefault
.autoscale
: This setting enables or disables autoscaling for the cluster. In this case, autoscaling is enabled with a minimum of 0 workers and a maximum of 1 worker.mode
: This setting specifies the mode of autoscaling. There are two modes:STANDARD
andENHANCED
.ENHANCED
mode can scale up and down faster and more efficiently thanSTANDARD
mode. In this case, the mode is set toENHANCED
.# The main pipeline for dlt_dabs_cicd
resources:
pipelines:
dlt_dabs_cicd_pipeline:
name: dlt_dabs_cicd_pipeline
target: dlt_dabs_cicd_${bundle.environment}
continuous: false
channel: CURRENT
photon: false
libraries:
- notebook:
path: 9.6 create-medallion-arch-DLT.sql
clusters:
- label: default
autoscale:
min_workers: 1
max_workers: 1
mode: ENHANCED
- Validate the Databricks Asset Bundle: Run the
bundle validate
command from the root directory of the bundle using the Databricks CLI. The validation is successful if you see a JSON representation of the bundle configuration. Otherwise, correct the errors and try again.databricks bundle validate
- Deploy the bundle in the dev environment: Run the
bundle deploy
command with the Databricks CLI to deploy the bundle in a specific environment. Once run, the local notebooks and SQL code will be moved to your Databricks workspace on the cloud and the Delta Live Tables pipeline will be set up.databricks bundle deploy -t dev
The output should look like this:
% databricks bundle deploy -t dev
Starting upload of bundle files
Uploaded bundle files at /Users/pulkit.chadha.packt@gmail.com/.bundle/dlt_dabs_cicd/dev/files!
Starting resource deployment
Resource deployment completed!
- See if the local notebook was moved: Go to Workspace in your Databricks workspace’s sidebar and navigate to Users |
<your-username>
|<.bundle > <project-name>
| dev | files | src. The notebook should be there:
Figure 9.18 – Databricks Bundle Working Directory
- See if the pipeline was set up: Go to Delta Live Tables in your Databricks workspace’s sidebar. Click on
[dev <your-username>] <project-name>_pipeline
on the Delta Live Tables tab.
Figure 9.19 – Databricks Asset Bundle dev pipeline
- Run the deployed bundle: You will run the Delta Live Tables pipeline in your workspace in this step. Go to the root directory and run the
bundle run
command with the Databricks CLI. Use your project name instead of<project-name>
.databricks bundle run -t dev
The output from the
run
command will list the pipeline events:sh-3.2$ databricks bundle run -t dev
Update URL: <ht tps: //a db-76 3794 0272 361795 .15.azuredatabricks. net/ #joblist/pipelines /1a3434c7-df77-41b9-b423-ff8bdcd002a2/ updates/ 839e9f65-6cb3-4e46-bf1e-ef0339b50421>
2023-10-23T21:24:07.574Z update_progress INFO "Update 839e9f is WAITING_FOR_RESOURCES."
2023-10-23T21:27:29.924Z update_progress INFO "Update 839e9f is INITIALIZING."
2023-10-23T21:27:44.805Z update_progress INFO "Update 839e9f is SETTING_UP_TABLES."
2023-10-23T21:28:01.566Z update_progress INFO "Update 839e9f is RUNNING."
2023-10-23T21:28:01.571Z flow_progress INFO "Flow 'device_data' is QUEUED."
....
2023-10-23T21:28:23.066Z flow_progress INFO "Flow 'user_metrics' is RUNNING."
2023-10-23T21:28:28.102Z flow_progress INFO "Flow 'user_metrics' has COMPLETED."
2023-10-23T21:28:29.987Z update_progress INFO "Update 839e9f is COMPLETED."
sh-3.2$
When the pipeline finishes successfully, you can view the details in your Databricks workspace:
Figure 9.20 – Delta Live Tables DAG
There’s more…
Within Databricks Asset Bundles, resources are a collection of entities that represent a data engineering or data science task. Each resource has a unique identifier and a set of properties that can be configured using the Databricks REST API. The YAML file has the following sections:
resources
: This section specifies the top-level mapping of resources and their default settings.experiments
: This section defines one or more experiments that are part of the resource. An experiment is a collection of runs that are associated with a machine learning model. Each experiment has a unique identifier and a set of properties that can be configured using the Experiments API.jobs
: This section defines one or more jobs that are part of the resource. A job is a scheduled or on-demand execution of a notebook, JAR, or Python script. Each job has a unique identifier and a set of properties that can be configured using the Jobs API.models
: This section defines one or more models that are part of the resource. A model is a machine learning artifact that can be registered, versioned, packaged, and deployed. Each model has a unique identifier and a set of properties that can be configured using the Models API.pipelines
: This section defines one or more pipelines that are part of the resource. A pipeline is a Delta Live Tables entity that consists of one or more datasets that are updated by applying transformations to the input data. Each pipeline has a unique identifier and a set of properties that can be configured using the Delta Live Tables API.
See also
- Databricks Asset Bundles documentation: https://docs.databricks.com/en/dev-tools/bundles/index.html
- Databricks Asset Bundles product tour: https://www.databricks.com/resources/demos/tours/data-engineering/databricks-asset-bundles
- Databricks Asset Bundles YAML Settings: https://docs.databricks.com/en/dev-tools/bundles/settings.html
- Databricks Asset Bundles Product page: https://www.databricks.com/resources/demos/tours/data-engineering/databricks-asset-bundles
- Databricks Asset Bundles session at DAIS 2023: https://www.databricks.com/dataaisummit/session/databricks-asset-bundles-standard-unified-approach-deploying-data-products/