Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Engineering with Databricks Cookbook

You're reading from   Data Engineering with Databricks Cookbook Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake

Arrow left icon
Product type Paperback
Published in May 2024
Publisher Packt
ISBN-13 9781837633357
Length 438 pages
Edition 1st Edition
Arrow right icon
Author (1):
Arrow left icon
Pulkit Chadha Pulkit Chadha
Author Profile Icon Pulkit Chadha
Pulkit Chadha
Arrow right icon
View More author details
Toc

Table of Contents (16) Chapters Close

Preface 1. Part 1 – Working with Apache Spark and Delta Lake FREE CHAPTER
2. Chapter 1: Data Ingestion and Data Extraction with Apache Spark 3. Chapter 2: Data Transformation and Data Manipulation with Apache Spark 4. Chapter 3: Data Management with Delta Lake 5. Chapter 4: Ingesting Streaming Data 6. Chapter 5: Processing Streaming Data 7. Chapter 6: Performance Tuning with Apache Spark 8. Chapter 7: Performance Tuning in Delta Lake 9. Part 2 – Data Engineering Capabilities within Databricks
10. Chapter 8: Orchestration and Scheduling Data Pipeline with Databricks Workflows 11. Chapter 9: Building Data Pipelines with Delta Live Tables 12. Chapter 10: Data Governance with Unity Catalog 13. Chapter 11: Implementing DataOps and DevOps on Databricks 14. Index 15. Other Books You May Enjoy

Deploying Delta Live Tables pipelines with Databricks Asset Bundles

Databricks Asset Bundles is a new feature that allows you to deploy your data, analytics, and machine learning projects as a collection of source files. You can use a YAML file to specify the resources and settings for your project, such as jobs, pipelines, endpoints, experiments, and models. This way, you can easily manage your code and infrastructure in a consistent and automated way.

The following are some of the benefits of using Databricks Asset Bundles:

  • You can use best practice tools and processes to work with source code, such as source control, code review, testing, and CI/CD
  • You can streamline your local development with IDEs and run your resources before deploying them to production
  • You can configure your deployments across multiple workspaces, regions, and clouds

In this recipe, you will learn how to package your pipeline as a Databricks Access Bundle and deploy it to a production workspace.

Getting ready

Before you start, you will need the following:

How to do it…

  1. Create a Databricks Asset Bundle manually: A Databricks Asset Bundle is a YAML file with the name databricks.yml. The file has three main sections:
    • The bundle section specifies the name of the bundle, which is dlt_dabs_cicd in this case.
    • The include section lists the source files that are part of the bundle. In this case, there is only one file: dlt_dabs_cicd_pipeline.yml. This is a YAML file that defines a Delta Live Tables pipeline.
    • The targets section defines the workspaces where the bundle can be deployed. There are two targets: dev and prod.
      • The dev target is used for development purposes. It has the following properties:
        • It uses the mode development setting, which adds a prefix such as [dev my_user_name] to everything that is deployed to this target. This prevents clashes with other developers who are working on the same project.
        • It also turns off any schedules and automatic triggers for jobs and turns on development mode for Delta Live Tables pipelines. This lets developers execute and test their code without impacting production data or resources.
        • It sets the default true setting, which means that this is the default target for deploying the bundle. If no target is specified, the bundle will be deployed to the dev target.
        • It specifies the workspace property, which contains the host URL of the workspace where the bundle will be deployed. In this case, it is htt ps:// adb-763 7940 27236 1795. 15.azure databr icks .net.
      • The prod target is used for production deployment. It has the following properties:
        • It uses the mode production setting, which means that everything deployed to this target does not get any prefix and uses the original names of the source files.
        • It overrides the workspace.root_path property, which is the default path where the bundle will be deployed in the workspace. By default, it is /Users/${workspace.current_user.userName}/.bundle/${bundle.target}/${bundle.name}, which is specific to the current user. In this case, it has been changed to /Shared/.bundle/prod/${bundle.name}, which is a shared path that is not specific to any user.
        • It specifies the same workspace property as the dev target, which means that both targets use the same workspace host URL.
        • It specifies the run_as property, which defines who will run the resources in the bundle. In this case, it uses the user_name: pulkit.chadha.packt@gmail.com setting, which means that everything will run as this user. Alternatively, a service principal name could be used here using the service_principal_name setting.
  2. Create resources within the Databricks Asset Bundle: We will only define the bundle with a single Delta Live Tables pipeline. The YAML file has the following sections:
    • pipelines: This section defines one or more pipelines that are part of the resource. Each pipeline has a unique name, a target schema, and other settings. In this case, there is only one pipeline: dlt_dabs_cicd_pipeline.
    • target: This setting specifies the schema where the output tables of the pipeline will be written. The target schema can be different for different environments, such as development, testing, or production. In this case, the target schema is dlt_dabs_cicd_${bundle.environment}, where ${bundle.environment} is a variable that can be replaced with the actual environment name.
    • continuous: This setting controls whether the pipeline runs in continuous mode or not. Continuous mode means that the pipeline will automatically update whenever there is new data available in the input sources. In this case, the pipeline runs in non-continuous mode, which means that it will only update when triggered manually or by a schedule.
    • channel: This setting specifies the channel of Delta Live Tables that the pipeline uses. A channel is a version of Delta Live Tables that has different features and capabilities. In this case, the pipeline uses the CURRENT channel, which is the latest and most stable version of Delta Live Tables.
    • photon: This setting controls whether the pipeline uses Photon or not. Photon is a vectorized query engine that can speed up data processing and reduce resource consumption. In this case, the pipeline does not use Photon.
    • libraries: This section lists the source code libraries that are required for the pipeline. In this case, there is only one library, which is a notebook named 9.6 create-medallion-arch-DLT.sql.
    • clusters: This section defines the cluster that is used to run the pipeline. In this case, there is only one cluster with a label of default.
    • autoscale: This setting enables or disables autoscaling for the cluster. In this case, autoscaling is enabled with a minimum of 0 workers and a maximum of 1 worker.
    • mode: This setting specifies the mode of autoscaling. There are two modes: STANDARD and ENHANCED. ENHANCED mode can scale up and down faster and more efficiently than STANDARD mode. In this case, the mode is set to ENHANCED.
      # The main pipeline for dlt_dabs_cicd
      resources:
        pipelines:
          dlt_dabs_cicd_pipeline:
            name: dlt_dabs_cicd_pipeline
            target: dlt_dabs_cicd_${bundle.environment}
            continuous: false
            channel: CURRENT
            photon: false
            libraries:
              - notebook:
                  path: 9.6 create-medallion-arch-DLT.sql
            clusters:
              - label: default
                autoscale:
                  min_workers: 1
                  max_workers: 1
                  mode: ENHANCED
  3. Validate the Databricks Asset Bundle: Run the bundle validate command from the root directory of the bundle using the Databricks CLI. The validation is successful if you see a JSON representation of the bundle configuration. Otherwise, correct the errors and try again.
    databricks bundle validate
  4. Deploy the bundle in the dev environment: Run the bundle deploy command with the Databricks CLI to deploy the bundle in a specific environment. Once run, the local notebooks and SQL code will be moved to your Databricks workspace on the cloud and the Delta Live Tables pipeline will be set up.
    databricks bundle deploy -t dev

    The output should look like this:

    % databricks bundle deploy -t dev
    Starting upload of bundle files
    Uploaded bundle files at /Users/pulkit.chadha.packt@gmail.com/.bundle/dlt_dabs_cicd/dev/files!
    Starting resource deployment
    Resource deployment completed!
  5. See if the local notebook was moved: Go to Workspace in your Databricks workspace’s sidebar and navigate to Users | <your-username> | <.bundle > <project-name> | dev | files | src. The notebook should be there:
Figure 9.18 – Databricks Bundle Working Directory

Figure 9.18 – Databricks Bundle Working Directory

  1. See if the pipeline was set up: Go to Delta Live Tables in your Databricks workspace’s sidebar. Click on [dev <your-username>] <project-name>_pipeline on the Delta Live Tables tab.
Figure 9.19 – Databricks Asset Bundle dev pipeline

Figure 9.19 – Databricks Asset Bundle dev pipeline

  1. Run the deployed bundle: You will run the Delta Live Tables pipeline in your workspace in this step. Go to the root directory and run the bundle run command with the Databricks CLI. Use your project name instead of <project-name>.
    databricks bundle run -t dev

    The output from the run command will list the pipeline events:

    sh-3.2$ databricks bundle run -t dev
    Update URL: <ht tps: //a db-76 3794 0272 361795 .15.azuredatabricks. net/ #joblist/pipelines /1a3434c7-df77-41b9-b423-ff8bdcd002a2/ updates/ 839e9f65-6cb3-4e46-bf1e-ef0339b50421>
    2023-10-23T21:24:07.574Z update_progress INFO "Update 839e9f is WAITING_FOR_RESOURCES."
    2023-10-23T21:27:29.924Z update_progress INFO "Update 839e9f is INITIALIZING."
    2023-10-23T21:27:44.805Z update_progress INFO "Update 839e9f is SETTING_UP_TABLES."
    2023-10-23T21:28:01.566Z update_progress INFO "Update 839e9f is RUNNING."
    2023-10-23T21:28:01.571Z flow_progress   INFO "Flow 'device_data' is QUEUED."
    ....
    2023-10-23T21:28:23.066Z flow_progress   INFO "Flow 'user_metrics' is RUNNING."
    2023-10-23T21:28:28.102Z flow_progress   INFO "Flow 'user_metrics' has COMPLETED."
    2023-10-23T21:28:29.987Z update_progress INFO "Update 839e9f is COMPLETED."
    sh-3.2$

    When the pipeline finishes successfully, you can view the details in your Databricks workspace:

Figure 9.20 – Delta Live Tables DAG

Figure 9.20 – Delta Live Tables DAG

There’s more…

Within Databricks Asset Bundles, resources are a collection of entities that represent a data engineering or data science task. Each resource has a unique identifier and a set of properties that can be configured using the Databricks REST API. The YAML file has the following sections:

  • resources: This section specifies the top-level mapping of resources and their default settings.
  • experiments: This section defines one or more experiments that are part of the resource. An experiment is a collection of runs that are associated with a machine learning model. Each experiment has a unique identifier and a set of properties that can be configured using the Experiments API.
  • jobs: This section defines one or more jobs that are part of the resource. A job is a scheduled or on-demand execution of a notebook, JAR, or Python script. Each job has a unique identifier and a set of properties that can be configured using the Jobs API.
  • models: This section defines one or more models that are part of the resource. A model is a machine learning artifact that can be registered, versioned, packaged, and deployed. Each model has a unique identifier and a set of properties that can be configured using the Models API.
  • pipelines: This section defines one or more pipelines that are part of the resource. A pipeline is a Delta Live Tables entity that consists of one or more datasets that are updated by applying transformations to the input data. Each pipeline has a unique identifier and a set of properties that can be configured using the Delta Live Tables API.

See also

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image