This article is an excerpt from the book, Data Engineering with Databricks Cookbook, by Pulkit Chadha. This book shows you how to use Apache Spark, Delta Lake, and Databricks to build data pipelines, manage and transform data, optimize performance, and more. Additionally, you’ll implement DataOps and DevOps practices, and orchestrate data workflows.
Databricks Unity Catalog allows you to manage and access data in cloud object storage using a unified namespace and a consistent set of APIs. With Unity Catalog, you can do the following:
In this article, you will learn what Unity Catalog is and how it integrates with AWS S3.
Before you start setting up and configuring Unity Catalog, you need to have the following prerequisites:
In this section, we will first create a storage credential, the IAM role, with access to an s3 bucket. Then, we will create an external location in Databricks Unity Catalog that will use the storage credential to access the s3 bucket.
You must create a storage credential to access data from an external location or a volume. In this example, you will create a storage credential that uses an IAM role taccess the S3 Bucket. The steps are as follows:
1. Go to Catalog Explorer: Click on Catalog in the left panel and go to Catalog Explorer.
2. Create storage credentials: Click on +Add and select Add a storage credential.
Figure 10.1 – Add a storage credential
3. Enter storage credential details: Give the credential a name, the IAM role ARN that allows Unity Catalog to access the storage location on your cloud tenant, and a comment if you want, and click on Create.
Figure 10.2 – Create a new storage credential
Important note
To learn more about IAM roles in AWS, you can reference the user guide here: https:// docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html.
4. Get External ID: In the Storage credential created dialog, copy the External ID value and click on Done.
Figure 10.3 – External ID for the storage credential
5. Update the trust policy with an External ID: Update the trust policy associated with the IAM role and add the External ID value for sts:ExternalId:
Figure 10.4 – Updated trust policy with External ID
An external location contains a reference to a storage credential and a cloud storage path. You need to create an external location to access data from a custom storage location that Unity Catalog uses to reference external tables. In this example, you will create an external location that points to the de-book-ext-loc folder in an S3 bucket. To create an external location, you can follow these steps:
1. Go to Catalog Explorer: Click on Catalog in the left panel to go to Catalog Explorer.
2. Create external location: Click on +Add and select Add an external location:
Figure 10.5 – Add an external location
3. Pick an external location creation method: Select Manual and then click on Next:
Figure 10.6 – Create a new external location
4. Enter external location details: Enter the external location name, select the storage credential, and enter the S3 URL; then, click on the Create button:
Figure 10.7 – Create a new external location manually
5. Test connection: Test the connection to make sure you have set up the credentials accurately and that Unity Catalog is able to access cloud storage:
Figure 10.8 – Test connection for external location
If everything is set up right, you should see a screen like the following. Click on Done:
Figure 10.9 – Test connection results
In summary, connecting to cloud object storage using Databricks Unity Catalog provides a streamlined approach to managing and accessing data across various cloud platforms such as AWS S3, Azure Blob Storage, and Google Cloud Storage. By utilizing a unified namespace, consistent APIs, and powerful governance features, Unity Catalog simplifies the process of creating and managing storage credentials and external locations. With built-in fine-grained access controls, you can securely manage data stored in different formats and cloud environments, all while leveraging Databricks' powerful data analytics capabilities. This guide walks through setting up an IAM role and creating an external location in AWS S3, demonstrating how easy it is to connect cloud storage with Unity Catalog.
Pulkit Chadha is a seasoned technologist with over 15 years of experience in data engineering. His proficiency in crafting and refining data pipelines has been instrumental in driving success across diverse sectors such as healthcare, media and entertainment, hi-tech, and manufacturing. Pulkit’s tailored data engineering solutions are designed to address the unique challenges and aspirations of each enterprise he collaborates with.