Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases now! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

Connecting Cloud Object Storage with Databricks Unity Catalog

Save for later
View related Packt books & videos

article-image

This article is an excerpt from the book, Data Engineering with Databricks Cookbook, by Pulkit Chadha. This book shows you how to use Apache Spark, Delta Lake, and Databricks to build data pipelines, manage and transform data, optimize performance, and more. Additionally, you’ll implement DataOps and DevOps practices, and orchestrate data workflows.

connecting-cloud-object-storage-with-databricks-unity-catalog-img-0

Introduction

Databricks Unity Catalog allows you to manage and access data in cloud object storage using a unified namespace and a consistent set of APIs. With Unity Catalog, you can do the following: 

  • Create and manage storage credentials, external locations, storage locations, and volumes using SQL commands or the Unity Catalog UI 
  • Access data from various cloud platforms (AWS S3, Azure Blob Storage, or Google Cloud Storage) and storage formats (Parquet, Delta Lake, CSV, or JSON) using the same SQL syntax or Spark APIs 
  • Apply fine-grained access control and data governance policies to your data using Databricks SQL Analytics or Databricks Runtime 

In this article, you will learn what Unity Catalog is and how it integrates with AWS S3. 

Getting ready 

Before you start setting up and configuring Unity Catalog, you need to have the following prerequisites: 

  • A Databricks workspace with administrator privileges 
  • A Databricks workspace with the Unity Catalog feature enabled 
  • A cloud storage account (such as AWS S3, Azure Blob Storage, or Google Cloud Storage) with the necessary permissions to read and write data 

How to do it… 

In this section, we will first create a storage credential, the IAM role, with access to an s3 bucket. Then, we will create an external location in Databricks Unity Catalog that will use the storage credential to access the s3 bucket. 

Creating a storage credential 

You must create a storage credential to access data from an external location or a volume. In this example, you will create a storage credential that uses an IAM role taccess the S3 Bucket. The steps are as follows: 

1. Go to Catalog Explorer: Click on Catalog in the left panel and go to Catalog Explorer

2. Create storage credentials: Click on +Add and select Add a storage credential

connecting-cloud-object-storage-with-databricks-unity-catalog-img-1

Figure 10.1 – Add a storage credential 

3. Enter storage credential details: Give the credential a name, the IAM role ARN that allows Unity Catalog to access the storage location on your cloud tenant, and a comment if you want, and click on Create

connecting-cloud-object-storage-with-databricks-unity-catalog-img-2 

Figure 10.2 – Create a new storage credential 

Important note 

To learn more about IAM roles in AWS, you can reference the user guide here: https:// docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html. 

4. Get External ID: In the Storage credential created dialog, copy the External ID value and click on Done

connecting-cloud-object-storage-with-databricks-unity-catalog-img-3 

Figure 10.3 – External ID for the storage credential 

5. Update the trust policy with an External ID: Update the trust policy associated with the IAM role and add the External ID value for sts:ExternalId

connecting-cloud-object-storage-with-databricks-unity-catalog-img-4 

Figure 10.4 – Updated trust policy with External ID 

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime

Creating an external location 

An external location contains a reference to a storage credential and a cloud storage path. You need to create an external location to access data from a custom storage location that Unity Catalog uses to reference external tables. In this example, you will create an external location that points to the de-book-ext-loc folder in an S3 bucket. To create an external location, you can follow these steps: 

1. Go to Catalog Explorer: Click on Catalog in the left panel to go to Catalog Explorer

2. Create external location: Click on +Add and select Add an external location

connecting-cloud-object-storage-with-databricks-unity-catalog-img-5 

Figure 10.5 – Add an external location 

3. Pick an external location creation method: Select Manual and then click on Next

connecting-cloud-object-storage-with-databricks-unity-catalog-img-6 

Figure 10.6 – Create a new external location 

4. Enter external location details: Enter the external location name, select the storage credential, and enter the S3 URL; then, click on the Create button: 

connecting-cloud-object-storage-with-databricks-unity-catalog-img-7 

Figure 10.7 – Create a new external location manually 

5. Test connection: Test the connection to make sure you have set up the credentials accurately and that Unity Catalog is able to access cloud storage: 

connecting-cloud-object-storage-with-databricks-unity-catalog-img-8 

Figure 10.8 – Test connection for external location 

If everything is set up right, you should see a screen like the following. Click on Done

connecting-cloud-object-storage-with-databricks-unity-catalog-img-9 

Figure 10.9 – Test connection results 

See also 

Conclusion 

In summary, connecting to cloud object storage using Databricks Unity Catalog provides a streamlined approach to managing and accessing data across various cloud platforms such as AWS S3, Azure Blob Storage, and Google Cloud Storage. By utilizing a unified namespace, consistent APIs, and powerful governance features, Unity Catalog simplifies the process of creating and managing storage credentials and external locations. With built-in fine-grained access controls, you can securely manage data stored in different formats and cloud environments, all while leveraging Databricks' powerful data analytics capabilities. This guide walks through setting up an IAM role and creating an external location in AWS S3, demonstrating how easy it is to connect cloud storage with Unity Catalog. 

Author Bio

Pulkit Chadha is a seasoned technologist with over 15 years of experience in data engineering. His proficiency in crafting and refining data pipelines has been instrumental in driving success across diverse sectors such as healthcare, media and entertainment, hi-tech, and manufacturing. Pulkit’s tailored data engineering solutions are designed to address the unique challenges and aspirations of each enterprise he collaborates with.