Azure Databricks Cookbook

Chapter 2: Reading and Writing Data from and to Various Azure Services and File Formats

Azure Databricks provides options for data engineers, data scientists, and data analysts to read and write data from and to various sources such as different file formats, databases, NoSQL databases, Azure Storage, and so on. Users get a lot of flexibility in ingesting and storing data in various formats as per business requirements, using Databricks. It also provides libraries to ingest data from streaming systems such as Events Hub and Kafka.

In this chapter, we will learn how we can read data from different file formats, such as comma-separated values (CSV), Parquet, and JavaScript Object Notation (JSON), and how to use native connectors to read and write data from and to Azure SQL Database and Azure Synapse Analytics. We will also learn how to read and store data in Azure Cosmos DB.

By the end of this chapter, you will have built a foundation for reading data from various sources that are required to work on the end-to-end (E2E) scenarios of a data ingestion pipeline. You will learn how and when to use JavaScript Database Connectivity (JDBC) drivers and Apache Spark connectors to ingest into the Azure SQL Database.

We're going to cover the following recipes in this chapter:

Mounting Azure Data Lake Storage Gen2 (ADLS Gen2) and Azure Blob storage to Azure Databricks File System (DBFS)
Reading and writing data from and to Azure Blob storage
Reading and writing data from and to ADLS Gen2
Reading and writing data from and to an Azure SQL database using native connectors
Reading and writing data from and to Azure Synapse Dedicated Structured Query Language (SQL) Pool using native connectors
Reading and writing data from and to the Azure Cosmos DB
Reading and writing data from and to CSV and Parquet
Reading and writing data from and to JSON, including nested JSON

Mounting ADLS Gen2 and Azure Blob storage to Azure DBFS

Azure Databricks uses DBFS, which is a distributed file system that is mounted into an Azure Databricks workspace and that can be made available on Azure Databricks clusters. DBFS is an abstraction that is built on top of Azure Blob storage and ADLS Gen2. It mainly offers the following benefits:

It allows you to mount the Azure Blob and ADLS Gen2 storage objects so that you can access files and folders without requiring any storage credentials.
You can read files directly from the mount point without needing to provide a full storage Uniform Resource Locator (URL).
You can create folders and write files directly to the mount point.
Data written to the mount point gets persisted after a cluster is terminated.

By the end of this recipe, you will have learned how to mount Azure Blob and ADLS Gen2 storage to Azure DBFS. You will learn how to access files and folders in Blob storage and ADLS Gen2 by doing the following:

Directly accessing the storage URL
Mounting the storage account to DBFS

Getting ready

Create ADLS Gen2 and Azure Blob storage resources by following the links provided in the Technical requirements section. In this recipe, the names of the storage resources we are using will be the following:

cookbookadlsgen2storage for ADLS Gen2 storage
cookbookblobstorage for Azure Blob storage

You can see the Storage Accounts we created in the following screenshot:

Figure 2.1 – Storage accounts created in the CookbookRG resource group

Before you get started, you will need to create a service principal that will be used to mount the ADLS Gen2 account to DBFS. Here are the steps that need to be followed to create a service principal from the Azure portal:

Application registration: You will need to register an Azure Active Directory (AAD) application. On the Azure portal home page, search for Azure Active Directory and select it. On the Azure Active Directory page, in the left pane, select App registrations and click on New registration:

Figure 2.2 – New application registration page
On the Register an application page, give any name to the application you are creating, leave the other options at their default values, and click Register:
Figure 2.3 – New application registration page (continued)
Once an application is created, you will see it listed on the App registrations page in AAD, as seen in the following screenshot:
Figure 2.4 – New application created
Select the new application you have created and get the application identifier (ID), and the tenant ID for the application that will be used for mounting the ADLS Gen2 account to DBFS:
Figure 2.5 – Getting application ID and tenant ID
To create a secret, click on Certificates & secrets under the Manage heading and click on the + New client secret option listed under Client secrets. You can provide any description for the secret and provide expiry as 1 year for this exercise:
Figure 2.6 – Adding client secret for the application
As soon as you create a secret, ensure you copy the value of the secret, else you cannot get the value of the existing secret later. You will have to create a new secret if the secret value is not copied immediately after it is created:
Figure 2.7 – Client secret value page
You now have an application ID, a tenant ID, and a secret—these are required to mount an ADLS Gen2 account to DBFS.

Once the application is created, we need to provide Blob storage contributor access to ADLSGen2App on the ADLS Gen2 storage account. The following steps demonstrate how to provide access to the ADLS Gen2 storage account:

From the Azure portal home page, go to the CookbookRG resource group and select the cookbookadlsgenstorage (ADLS Gen2 storage) account you have created. Click Access Control (IAM) then click on + Add, and select the Add role assignment option. On the Add role assignment blade, assign the Storage Blob Data Contributor role to our service principal (that is, ADLSAccess):
Figure 2.8 – Adding permissions to ADLS Gen2 for service principal
Under Add role assignment, select a role and access for ADLSGen2App, as shown in the following screenshot, and click on the Save button:
Figure 2.9 – Adding permissions to ADLS Gen2 for service principal
We require a storage key so that we can mount the Azure Blob storage account to DBFS. The following steps show how to get a storage key for the Azure Blob storage account (cookbookblobstorage) we have already created.
From the Azure portal home page, go to the CookbookRG resource group and select the cookbookblobstorage (ADLS Blob storage) account you have created. Click on Access keys under Settings and click on the Show keys button. The value you see for the key1 key is the storage key we will use to mount the Azure Blob storage account to DBFS:

Figure 2.10 – Azure Blob storage account access key
Copy the value of key1, which you will see when you click on Show keys. The process of getting a storage key is the same for an Azure Blob storage account and an ADLS Gen2 storage account.
You can find the notebook that we will be using to mount Azure Blob storage and ADLS Gen2 in the Chapter02 folder of your local cloned Git repository.
After you import the following two notebooks, you can follow along with the code in the two notebooks for this recipe:
(a) 2-1.1.Mounting ADLS Gen-2 Storage FileSystem to DBFS.ipynb
(b) 2-1.2.Mounting Azure Blob Storage Container to DBFS.ipynb
Create a container named rawdata in both the cookbookadlsgen2storage and cookbookblobstorage accounts you have already created, and upload the Orders.csv file, which you will find in the Chapter02 folder of your cloned Git repository.
Note
We have tested the steps mentioned in this recipe on Azure Databricks Runtime version 6.4 which includes Spark 2.4.5 and on Runtime version 7.3 LTS which includes Spark 3.0.1.

How to do it…

The following steps show how to mount an ADLS Gen2 storage account to DBFS and view the files and folders in the rawdata folder:

Launch a Databricks workspace, open the 2_1.1.Mounting ADLS Gen-2 Storage FileSystem to DBFS.ipynb notebook, and execute the first cell in the notebook, which contains the code shown next. Follow the steps mentioned in the Getting ready section to get the application ID, tenant ID, and secret, and replace the values for the variables used in the following code snippet for clientID, tenantID, and clientSecret:

#ClientId, TenantId and Secret is for the Application(ADLSGen2App) was have created as part of this recipe
clientID =" XXXXXb3dd-4f6e-4XXXX-b6fa-aXXXXXXX00db"
tenantID ="xxx-xxx-XXXc-xx-eXXXXXXXXXX"
clientSecret ="xxxxxx-xxxxxxxxxx-XXXXXX"
oauth2Endpoint = "https://login.microsoftonline.com/{}/oauth2/token".format(tenantID)
configs = {"fs.azure.account.auth.type": "OAuth",
           "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
           "fs.azure.account.oauth2.client.id": clientID,
           "fs.azure.account.oauth2.client.secret": clientSecret,
           "fs.azure.account.oauth2.client.endpoint": oauth2Endpoint}
try:
  dbutils.fs.mount(
  source = storageEndPoint,
  mount_point = mountpoint,
  extra_configs = configs)
except:
    print("Already mounted...."+mountpoint)

After the preceding steps are executed, the ADLS Gen2 storage account will be mounted to /mnt/Gen2 in DBFS. We can check the folders and files in the storage account by executing the following code:
```
%fs ls /mnt/Gen2
```
You can also check the files and folders using the dbutils command, as shown in the following code snippet:
```
display(dbutils.fs.ls("/mnt/Gen2"))
```
Upon executing the preceding command, you should see all the folders and files you have created in the storage account.
To ensure we can read the orders.csv file from the mounted path, we will execute the following code:
```
df_ord= spark.read.format("csv").option("header",True).load("dbfs:/mnt/Gen2/Orders.csv")
```
The following code will display the DataFrame's contents:
```
display(df_ord)
```

Up to now, we have learned how to mount ADLS Gen2 to DBFS. Now, the following steps show us how to mount an Azure Blob storage account to DBFS and list all files and folders created in the Blob storage account:

Launch a Databricks workspace, open the 2-1.2.Mounting Azure Blob Storage Container to DBFS.ipynb notebook, and execute the first cell in the notebook, which contains the following code:

#Storage account and key you will get it from the portal as shown in the Cookbook Recipe.
storageAccount="cookbookblobstorage"
storageKey ="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=="
mountpoint = "/mnt/Blob"
storageEndpoint =   "wasbs://rawdata@{}.blob.core.windows.net".format(storageAccount)
storageConnSting = "fs.azure.account.key.{}.blob.core.windows.net".format(storageAccount)
try:
  dbutils.fs.mount(
  source = storageEndpoint,
  mount_point = mountpoint,
  extra_configs = {storageConnSting:storageKey})
except:
    print("Already mounted...."+mountpoint)

After the preceding steps are executed, the ADLS Gen2 storage account will be mounted to /mnt/Gen2 in DBFS. We can check the folders and files available in the storage account by executing the following code:
```
%fs ls /mnt/Blob
```
You can also check the files and folders using dbutils, as shown in the following code snippet:
```
display(dbutils.fs.ls("/mnt/Blob"))
```
You should see all the folders and files you have created in the storage account as the output of the preceding code.

Run the following code to read the CSV file from the mount point:

df_ord= spark.read.format("csv").option("header",True).load("dbfs:/mnt/Blob/Orders.csv")

The following code will display the DataFrame's contents:
```
display(df_ord.limit(10))
```

The preceding code will display 10 records from the DataFrame.

How it works…

The preferred way of accessing an ADLS Gen2 storage account is by mounting the storage account file system using a service principal and Open Authentication 2.0 (OAuth 2.0). There are other ways of accessing a storage account from a Databricks notebook. These are listed here:

Using a service principal directly without mounting the file system
Using a storage key to access the Gen2 storage account directly without mounting
Using a shared access signature (SAS) token

We will learn about the preceding options in the next recipes. For Azure Blob storage, you have learned how to mount a storage account by using a storage key, but there are other options as well to access an Azure Blob storage account from a Databricks notebook. These are listed here:

Using the Spark Dataframe application programming interface (API). We will learn about this option in the next recipe.
Using a Resilient Distributed Dataset (RDD) API. We will not talk about this option as all our examples are using DataFrames, which is the preferred method for loading data in Databricks.

To view files in the mount points, Databricks has provided utilities to interact with the file system, called dbutils. You can perform file system operations such as listing files/folders, copying files, creating directories, and so on. You can find an entire list of operations you can perform by running the following command:

dbutils.fs.help()

The preceding command will list all the operations that can be performed on a file system in Databricks.

There's more…

You can also authenticate to ADLS Gen-2 storage accounts using storage account access key as well, but it is less secure and only preferred in non-production environments. You can get the storage account access key using the same method you have learnt for Azure Blob storage account in the Getting ready section of this recipe. You can run the following steps to Authenticate ADLS Gen-2 using access keys and read Orders.csv data.

Run the following to set the storage account and access key details in variables.

#This is ADLS Gen-2 accountname and access key details
storageaccount="demostoragegen2"
acct_info=f"fs.azure.account.key.{storageaccount}.dfs.core.windows.net"
accesskey="xxx-xxx-xxx-xxx" 
print(acct_info)

To authenticate using access key we need to set the notebook session configs by running the following code.

#Setting account credentials in notebook session configs
spark.conf.set(
    acct_info,
   accesskey)

Run the following code to verify we can authenticate using access key and list the Orders.csv file information.

dbutils.fs.ls("abfss://rawdata@demostoragegen2.dfs.core.windows.net/Orders.csv")

Let's read the Orders.csv file by running through the following code.

ordersDF =spark.read.format("csv").option("header",True).load("abfss://rawdata@demostoragegen2.dfs.core.windows.net/Orders.csv")

In this section you have learnt how to authenticate and read data from ADLS Gen-2 using Storage Account access keys.

Reading and writing data from and to Azure Blob storage

In this recipe, you will learn how to read and write data from and to Azure Blob storage from Azure Databricks. You will learn how to access an Azure Blob storage account by doing the following:

Mounting storage: Covered in the Mounting ADLS Gen2 and Azure Blob storage to Azure DBFS recipe of this chapter.
Directly accessing the Blob storage account: In this scenario, we will not mount the Blob storage account, but we will directly access the storage endpoint to read and write files.

By the end of this recipe, you will know multiple ways to read/write files from and to an Azure Blob storage account.

Getting ready

You will need to ensure that the Azure Blob storage account is mounted by following the steps mentioned in the previous recipe. Get the storage key by following the steps mentioned in the Mounting ADLS Gen2 and Azure Blob storage to Azure DBFS recipe of this chapter. You can follow along by running the steps in the 2-2.Reading and Writing Data from and to Azure Blob Storage.ipynb notebook in your local cloned repository in the Chapter02 folder.

Upload the csvFiles folder in the Chapter02/Customer folder to the Azure Blob storage account in the rawdata container.

Note

We have tested the steps mentioned in this recipe on Azure Databricks Runtime version 6.4 which includes Spark 2.4.5 and on Runtime version 7.3 LTS which includes Spark 3.0.1

How to do it…

We will learn how to read the csv files under the Customer folder from the mount point and the Blob storage account directly. We will also learn how to save the DataFrame results as Parquet files in the mount point and directly to the Azure Blob storage without using the mount point:

Let's list the csv files we are trying to read by using the following code:
```
display(dbutils.fs.ls("/mnt/Blob/Customer/csvFiles/"))
```
Now, we will read the csv files in a DataFrame directly from the mount point without specifying any schema options:
```
df_cust= spark.read.format("csv").option("header",True).load("/mnt/Blob/Customer/csvFiles/")
```
When you run df_cust.printSchema(), you will find that the datatypes for all columns are strings.

Here, we are asking Spark to infer the schema from the csv files by using option("header","true"):

df_cust= spark.read.format("csv").option("header",True).option("inferSchema", True).load("/mnt/Blob/Customer/csvFiles/")

Run the df_cust.printSchema() code, and you will find the datatype has changed for a few columns, such as CustKey, where the datatype is now being shown as an integer instead of a String.

We will create a schema and explicitly assign the schema while reading the CSV files:

cust_schema = StructType([
    StructField("C_CUSTKEY", IntegerType()),
    StructField("C_NAME", StringType()),
    StructField("C_ADDRESS", StringType()),
    StructField("C_NATIONKEY", ShortType()),
    StructField("C_PHONE", StringType()),
    StructField("C_ACCTBAL", DoubleType()),
    StructField("C_MKTSEGMENT", StringType()),
    StructField("C_COMMENT", StringType())
])

We will now create a DataFrame with the schema created in the preceding step. In this step, we are further controlling datatypes, such as for NationKey, where we are using ShortType as the datatype instead of IntegerType:
```
df_cust= spark.read.format("csv").option("header",True).schema(cust_schema).load("/mnt/Blob/Customer/csvFiles/")
```

In this step, we will write the DataFrame that we have created in the preceding step to a mount point as a Parquet file. We will repartition the DataFrame to 10 so that we are sure 10 Parquet files are created in the mount point:

Mountpoint= "/mnt/Blob"
parquetCustomerDestMount = "{}/Customer/parquetFiles".format(mountpoint)"{}/Customer/parquetFiles".format(mountpoint)
df_cust_partitioned=df_cust.repartition(10)
df_cust_partitioned.write.mode("overwrite").option("header", "true").parquet(parquetCustomerDestMount)

We are creating a storageEndpoint variable that stores the full URL for the storage account; this is used to write the data directly to Azure Blob storage without using the mount point and is declared in the second cell of the notebook:
```
storageEndpoint ="wasbs://rawdata@{}.blob.core.windows.net".format(storageAccount)
```
Set up a storage access key so that we can directly read and write data from and to Azure Blob storage:
```
spark.conf.set(storageConnSting,storageKey)
```

After the preceding step is executed, you can directly read the CSV files from Azure Blob storage without mounting to a mount point:

df_cust= spark.read.format("csv").option("header",True).schema(cust_schema).load("wasbs://rawdata@cookbookblobstorage.blob.core.windows.net/Customer/csvFiles/")

You can view a few records of the DataFrame by executing the following code:
```
display(df_cust.limit(10))
```

Let's save the csv data in Parquet format in Azure Blob storage directly without using the mount point. We can do this by executing the following code. We are repartitioning the DataFrame to ensure we are creating 10 Parquet files:

parquetCustomerDestDirect = "wasbs://rawdata@cookbookblobstorage.blob.core.windows.net/Customer/csvFiles/parquetFilesDirect"
df_cust_partitioned_direct=df_cust.repartition(10)
df_cust_partitioned_direct.write.mode("overwrite").option("header", "true").parquet(parquetCustomerDestDirect)

You can view the Parquet files created in the preceding step by executing the following code:
```
display(dbutils.fs.ls(parquetCustomerDestDirect))
```

How it works…

We have seen both ways to read and write data from and to Azure Blob storage, but in most scenarios, the preferred method is to mount the storage. This way, users don't have to worry about the source or destination storage account names and URLs.

The following code is used to directly access the Blob storage without mounting it to DBFS. Without running the following code, you will not be able to access the storage account and will encounter access errors while attempting to access the data:

spark.conf.set(storageConnSting,storageKey)

You can only mount block blobs to DBFS, and there are other blob types that Azure Blob storage supports—these are page and append. Once the blob storage is mounted, all users would have read and write access to the blob that is mounted to DBFS.

There's more…

Instead of a storage key, we can also access the Blob storage directly using the SAS of a container. You first need to create a SAS from the Azure portal:

Go to the Azure portal home page, and then, in the CookBookRG resource group, open the cookbookblobstorage Azure Blob storage account. Select the Shared access signature option under Settings:
Figure 2.11 – Generating a SAS token for Azure Blob storage account
Once you click on Generate SAS and connection string, it will list the SAS token, URL, and other details, as shown in the following screenshot:
Figure 2.12 – Getting SAS token keys
We will be using a SAS token, as seen in the preceding screenshot, to authenticate and use it in our spark.conf.set code.

You can execute the following code to set up a SAS for a container. You will find this code in the second-to-last cell of the notebook:

storageConnSting = "fs.azure.sas.rawdata.{}.blob.core.windows.net".format(storageAccount)
spark.conf.set(
 storageConnSting,
  "?sv=2019-12-12&ss=bfqt&srt=sco&sp=rwdlacupx&se=2021-02-01T02:33:49Z&st=2021-01-31T18:33:49Z&spr=https&sig=zzzzzzzzzzzzzz")

rawdata in the preceding code snippet is the name of the container we created in the Azure Blob storage account. After executing the preceding code, we are authenticating to the Azure blob using a SAS token.
You can read the files and folders after authenticating by running display(dbutils.fs.ls(storageEndpointFolders)).

Reading and writing data from and to ADLS Gen2

In this recipe, you will learn how to read and write data to ADLS Gen2 from Databricks. We can do this by following these two methods:

Mounting storage: Covered in the Mounting ADLS Gen2 and Azure Blob storage to Azure DBFS recipe of this chapter.
Directly accessing the ADLS Gen2 storage using a SAS token and a service principal: In this scenario, we will not mount the storage, but we will directly access the storage endpoint to read and write files using storage keys, service principals, and OAuth 2.0.

ADLS Gen2 provides file system semantics, which provides security to folders and files, and the hierarchical directory structure provides efficient access to the data in the storage.

By the end of this recipe, you will know multiple ways to read/write files from and to an ADLS Gen2 account.

Getting ready

You will need to ensure you have the following items before starting to work on this recipe:

An ADLS Gen2 account, mounted by following the steps in the first recipe of this chapter, Mounting ADLS Gen2 and Azure Blob to Azure Databricks File System.
Storage keys—you can get these by following the steps mentioned in the first recipe of this chapter, Mounting ADLS Gen2 and Azure Blob to Azure Databricks File System.

You can follow along by running the steps in the 2-3.Reading and Writing Data from and to ADLS Gen-2.ipynb notebook in your local cloned repository in the Chapter02 folder.

Upload the csvFiles folder in the Chapter02/Customer folder to the ADLS Gen2 account in the rawdata file system.

Note

We have tested the steps mentioned in this recipe on Azure Databricks Runtime version 6.4 which includes Spark 2.4.5 and on Runtime version 7.3 LTS which includes Spark 3.0.1

How to do it…

We will learn how to read CSV files from the mount point and the ADLS Gen2 storage directly. We will perform basic aggregation on the DataFrame, such as counting and storing the result in another csv file.

Working with the mount point, we'll proceed as follows:

Let's list the CSV files we are trying to read from the mount point:
```
display(dbutils.fs.ls("/mnt/Gen2/Customer/csvFiles/"))
```
We will read the csv files directly from the mount point without specifying any schema options:
```
df_cust= spark.read.format("csv").option("header",True).load("/mnt/Gen2/Customer/csvFiles/")
```
When you run df_cust.printSchema(), you will find that the datatypes for all columns are strings.
Next, we will run the same code as in the preceding step, but this time asking Spark to infer the schema from csv files by using option("header","true"):
```
df_cust= spark.read.format("csv").option("header",True).option("inferSchema", True).load("/mnt/Gen2/Customer/csvFiles/")
```
Run df_cust.printSchema(), and you will find the datatype has changed for a few columns such as CustKey, where the datatype now being shown is an integer instead of a string.

We will now create a schema and explicitly provide it while reading the csv files using a DataFrame:

cust_schema = StructType([
    StructField("C_CUSTKEY", IntegerType()),
    StructField("C_NAME", StringType()),
    StructField("C_ADDRESS", StringType()),
    StructField("C_NATIONKEY", ShortType()),
    StructField("C_PHONE", StringType()),
    StructField("C_ACCTBAL", DoubleType()),
    StructField("C_MKTSEGMENT", StringType()),
    StructField("C_COMMENT", StringType())
])

Create a DataFrame by using the schema created in the preceding step:

df_cust= spark.read.format("csv").option("header",True).schema(cust_schema).load("/mnt/Gen2/Customer/csvFiles/")

In the following step, we will be performing basic aggregation on the DataFrame:

df_cust_agg = df_cust.groupBy("C_MKTSEGMENT") .agg(sum("C_ACCTBAL").cast('decimal(20,3)').alias("sum_acctbal"), avg("C_ACCTBAL").alias("avg_acctbal"), max("C_ACCTBAL").alias("max_bonus")).orderBy("avg_acctbal",ascending=False)

We will write the DataFrame we created in the preceding step to the mount point and save it in CSV format:
```
df_cust_agg.write.mode("overwrite").option("header", "true").csv("/mnt/Gen-2/CustMarketSegmentAgg/"))
```
To list the CSV file created, run the following code:
```
(dbutils.fs.ls("/mnt/Gen-2/CustMarketSegmentAgg/"))
```

We'll now work with an ADLS Gen2 storage account without mounting it to DBFS:

You can access an ADLS Gen2 storage account directly without mounting to DBFS using OAuth 2.0 and a service principal. You can access any ADLS Gen2 storage account that the service principal has permissions on. We need to set the credentials first in our notebook before we can directly access the file system. clientID and clientSecret are the variables defined in the notebook:

spark.conf.set("fs.azure.account.auth.type.cookbookadlsgen2storage.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.cookobookadlsgen2storage.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.cookbookadlsgen2storage.dfs.core.windows.net", clientID)
spark.conf.set("fs.azure.account.oauth2.client.secret.cookbookadlsgen2storage.dfs.core.windows.net", clientSecret)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.cookbookadlsgen2storage.dfs.core.windows.net", oauth2Endpoint)

After the preceding step is executed, you can directly read the csv files from the ADLS Gen2 storage account without mounting it:

df_direct = spark.read.format("csv").option("header",True).schema(cust_schema).load("abfss://rawdata@cookbookadlsgen2storage.dfs.core.windows.net/Customer/csvFiles")

You can view a few records of the DataFrame by executing the following code:
```
display(df_direct.limit(10))
```

We will now write a DataFrame in Parquet format in the ADLS Gen2 storage account directly, without using the mount point, by executing the following code. We are repartitioning the DataFrame to ensure we are creating 10 Parquet files:

parquetCustomerDestDirect = "abfss://rawdata@cookbookadlsgen2storage.dfs.core.windows.net/Customer/parquetFiles"
df_direct_repart=df_direct.repartition(10)
df_direct_repart.write.mode("overwrite").option("header", "true").parquet(parquetCustomerDestDirect)

You can create a DataFrame on the Parquet files created in the preceding step to ensure we are able to read the data:

df_parquet = spark.read.format("parquet").option("header",True).schema(cust_schema).load("abfss://rawdata@cookbookadlsgen2storage.dfs.core.windows.net/Customer/parquetFiles")

You can view the Parquet files created in the preceding step by executing the following code:
```
display(dbutils.fs.ls(parquetCustomerDestDirect))
```

How it works…

The following code is set to directly access the ADL Gen2 storage account without mounting to DBFS. These settings are applicable when we are using DataFrame or dataset APIs:

spark.conf.set("fs.azure.account.auth.type.cookbookadlsgen2storage.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.cookbookadlsgen2storage.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.cookbookadlsgen2storage.dfs.core.windows.net", clientID)
spark.conf.set("fs.azure.account.oauth2.client.secret.cookbookadlsgen2storage.dfs.core.windows.net", clientSecret)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.cookbookadlsgen2storage.dfs.core.windows.net", oauth2Endpoint)

You should set the preceding values in your notebook session if you want the users to directly access the ADLS Gen2 storage account without mounting to DBFS. This method is useful when you are doing some ad hoc analysis and don't want users to create multiple mount points when you are trying to access data from various ADLS Gen2 storage accounts.

Filter reviews by

All

Amazon verified reviews

Anand Prasad Sep 22, 2021

A fantastic book , in the era where engineers/authors are very satisfied with high level designs and architecture , this book provides the real time use cases and provides in-depth hands-on experience of Azure Data Bricks. Having worked in Data Bricks for last 2 years , I felt this is a good place to recap and check the implementation details of the data bricks scenarios. Overall a must-read for budding & experienced data professionals who are working in azure data bricks.

Amazon Verified review

Kensho Sep 19, 2021

This is a great book to get going with Azure databricks without getting bogged down too much with details of spark internals. Other books currently available are more like an instruction manual to spark. First couple of chapters explain how databricks is set up on top of azure. There are chapters on Spark query execution and spark streaming, but it is to the point, without overwhelming the readers. There are chapters on dealing with various source files (csv,parquet,json). I loved that there are lot of integration aspects with azure ecosystem and this is the USP of the book- Mounting to Databricks, using Azure key vault, Setting up Log Analytics, Setting up security with RBAC and ACLs, connecting to Power BI, connecting to eventhubs, cosmos, Synapse and even using ADF for orchestration. There are dedicated chapters on delta lake (with a lot of details) and SQL Analytics. Folks who are looking to do a deep dive on spark may use learning spark or definitive guide from oreilly. I got this book to understand how to use databricks and it did exactly that.

tania Sep 20, 2021

great book very detailed, good read

Amazon Customer Oct 11, 2021

This book is very helpfull, it is very well organized. I think the book is quite didactic, which makes learning pretty easy.I love that it is very detail with a lot of detailed examples so you could learn fast.I also loved, that it has theoretical and practical details for a great understanding of databricks.

Jorge Sep 23, 2021

Excelente libro , muy completa la explicación de la autora del libro. La parte de dataframe es muy completa .

Azure Databricks Cookbook: Accelerate and scale real-time analytics solutions using the Apache Spark-based analytics service

What do you get with Print?

Azure Databricks Cookbook

Chapter 2: Reading and Writing Data from and to Various Azure Services and File Formats

Technical requirements

Mounting ADLS Gen2 and Azure Blob storage to Azure DBFS

Getting ready

How to do it…

How it works…

There's more…

Reading and writing data from and to Azure Blob storage

Getting ready

How to do it…

How it works…

There's more…

Reading and writing data from and to ADLS Gen2

Getting ready

How to do it…

How it works…

Page 1 of 10

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the authors

FAQs

Azure Databricks Cookbook: Accelerate and scale real-time analytics solutions using the Apache Spark-based analytics service

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the authors

FAQs