Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Azure Databricks Cookbook

You're reading from   Azure Databricks Cookbook Accelerate and scale real-time analytics solutions using the Apache Spark-based analytics service

Arrow left icon
Product type Paperback
Published in Sep 2021
Publisher Packt
ISBN-13 9781789809718
Length 452 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Vinod Jaiswal Vinod Jaiswal
Author Profile Icon Vinod Jaiswal
Vinod Jaiswal
Phani Raj Phani Raj
Author Profile Icon Phani Raj
Phani Raj
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Preface 1. Chapter 1: Creating an Azure Databricks Service 2. Chapter 2: Reading and Writing Data from and to Various Azure Services and File Formats FREE CHAPTER 3. Chapter 3: Understanding Spark Query Execution 4. Chapter 4: Working with Streaming Data 5. Chapter 5: Integrating with Azure Key Vault, App Configuration, and Log Analytics 6. Chapter 6: Exploring Delta Lake in Azure Databricks 7. Chapter 7: Implementing Near-Real-Time Analytics and Building a Modern Data Warehouse 8. Chapter 8: Databricks SQL 9. Chapter 9: DevOps Integrations and Implementing CI/CD for Azure Databricks 10. Chapter 10: Understanding Security and Monitoring in Azure Databricks 11. Other Books You May Enjoy

Reading and writing data from and to Azure Blob storage

In this recipe, you will learn how to read and write data from and to Azure Blob storage from Azure Databricks. You will learn how to access an Azure Blob storage account by doing the following:

  • Mounting storage: Covered in the Mounting ADLS Gen2 and Azure Blob storage to Azure DBFS recipe of this chapter.
  • Directly accessing the Blob storage account: In this scenario, we will not mount the Blob storage account, but we will directly access the storage endpoint to read and write files.

By the end of this recipe, you will know multiple ways to read/write files from and to an Azure Blob storage account.

Getting ready

You will need to ensure that the Azure Blob storage account is mounted by following the steps mentioned in the previous recipe. Get the storage key by following the steps mentioned in the Mounting ADLS Gen2 and Azure Blob storage to Azure DBFS recipe of this chapter. You can follow along by running the steps in the 2-2.Reading and Writing Data from and to Azure Blob Storage.ipynb notebook in your local cloned repository in the Chapter02 folder.

Upload the csvFiles folder in the Chapter02/Customer folder to the Azure Blob storage account in the rawdata container.

Note

We have tested the steps mentioned in this recipe on Azure Databricks Runtime version 6.4 which includes Spark 2.4.5 and on Runtime version 7.3 LTS which includes Spark 3.0.1

How to do it…

We will learn how to read the csv files under the Customer folder from the mount point and the Blob storage account directly. We will also learn how to save the DataFrame results as Parquet files in the mount point and directly to the Azure Blob storage without using the mount point:

  1. Let's list the csv files we are trying to read by using the following code:
    display(dbutils.fs.ls("/mnt/Blob/Customer/csvFiles/"))
  2. Now, we will read the csv files in a DataFrame directly from the mount point without specifying any schema options:
    df_cust= spark.read.format("csv").option("header",True).load("/mnt/Blob/Customer/csvFiles/")
  3. When you run df_cust.printSchema(), you will find that the datatypes for all columns are strings.
  4. Here, we are asking Spark to infer the schema from the csv files by using option("header","true"):
    df_cust= spark.read.format("csv").option("header",True).option("inferSchema", True).load("/mnt/Blob/Customer/csvFiles/")
  5. Run the df_cust.printSchema() code, and you will find the datatype has changed for a few columns, such as CustKey, where the datatype is now being shown as an integer instead of a String.
  6. We will create a schema and explicitly assign the schema while reading the CSV files:
    cust_schema = StructType([
        StructField("C_CUSTKEY", IntegerType()),
        StructField("C_NAME", StringType()),
        StructField("C_ADDRESS", StringType()),
        StructField("C_NATIONKEY", ShortType()),
        StructField("C_PHONE", StringType()),
        StructField("C_ACCTBAL", DoubleType()),
        StructField("C_MKTSEGMENT", StringType()),
        StructField("C_COMMENT", StringType())
    ])
  7. We will now create a DataFrame with the schema created in the preceding step. In this step, we are further controlling datatypes, such as for NationKey, where we are using ShortType as the datatype instead of IntegerType:
    df_cust= spark.read.format("csv").option("header",True).schema(cust_schema).load("/mnt/Blob/Customer/csvFiles/")
  8. In this step, we will write the DataFrame that we have created in the preceding step to a mount point as a Parquet file. We will repartition the DataFrame to 10 so that we are sure 10 Parquet files are created in the mount point:
    Mountpoint= "/mnt/Blob"
    parquetCustomerDestMount = "{}/Customer/parquetFiles".format(mountpoint)"{}/Customer/parquetFiles".format(mountpoint)
    df_cust_partitioned=df_cust.repartition(10)
    df_cust_partitioned.write.mode("overwrite").option("header", "true").parquet(parquetCustomerDestMount)
  9. We are creating a storageEndpoint variable that stores the full URL for the storage account; this is used to write the data directly to Azure Blob storage without using the mount point and is declared in the second cell of the notebook:
    storageEndpoint ="wasbs://rawdata@{}.blob.core.windows.net".format(storageAccount)
  10. Set up a storage access key so that we can directly read and write data from and to Azure Blob storage:
    spark.conf.set(storageConnSting,storageKey)
  11. After the preceding step is executed, you can directly read the CSV files from Azure Blob storage without mounting to a mount point:
    df_cust= spark.read.format("csv").option("header",True).schema(cust_schema).load("wasbs://rawdata@cookbookblobstorage.blob.core.windows.net/Customer/csvFiles/")
  12. You can view a few records of the DataFrame by executing the following code:
    display(df_cust.limit(10))
  13. Let's save the csv data in Parquet format in Azure Blob storage directly without using the mount point. We can do this by executing the following code. We are repartitioning the DataFrame to ensure we are creating 10 Parquet files:
    parquetCustomerDestDirect = "wasbs://rawdata@cookbookblobstorage.blob.core.windows.net/Customer/csvFiles/parquetFilesDirect"
    df_cust_partitioned_direct=df_cust.repartition(10)
    df_cust_partitioned_direct.write.mode("overwrite").option("header", "true").parquet(parquetCustomerDestDirect)
  14. You can view the Parquet files created in the preceding step by executing the following code:
    display(dbutils.fs.ls(parquetCustomerDestDirect))

How it works…

We have seen both ways to read and write data from and to Azure Blob storage, but in most scenarios, the preferred method is to mount the storage. This way, users don't have to worry about the source or destination storage account names and URLs.

The following code is used to directly access the Blob storage without mounting it to DBFS. Without running the following code, you will not be able to access the storage account and will encounter access errors while attempting to access the data:

spark.conf.set(storageConnSting,storageKey)

You can only mount block blobs to DBFS, and there are other blob types that Azure Blob storage supports—these are page and append. Once the blob storage is mounted, all users would have read and write access to the blob that is mounted to DBFS.

There's more…

Instead of a storage key, we can also access the Blob storage directly using the SAS of a container. You first need to create a SAS from the Azure portal:

  1. Go to the Azure portal home page, and then, in the CookBookRG resource group, open the cookbookblobstorage Azure Blob storage account. Select the Shared access signature option under Settings:
    Figure 2.11 – Generating a SAS token for Azure Blob storage account

    Figure 2.11 – Generating a SAS token for Azure Blob storage account

  2. Once you click on Generate SAS and connection string, it will list the SAS token, URL, and other details, as shown in the following screenshot:
    Figure 2.12 – Getting SAS token keys

    Figure 2.12 – Getting SAS token keys

  3. We will be using a SAS token, as seen in the preceding screenshot, to authenticate and use it in our spark.conf.set code.
  4. You can execute the following code to set up a SAS for a container. You will find this code in the second-to-last cell of the notebook:
    storageConnSting = "fs.azure.sas.rawdata.{}.blob.core.windows.net".format(storageAccount)
    spark.conf.set(
     storageConnSting,
      "?sv=2019-12-12&ss=bfqt&srt=sco&sp=rwdlacupx&se=2021-02-01T02:33:49Z&st=2021-01-31T18:33:49Z&spr=https&sig=zzzzzzzzzzzzzz")
  5. rawdata in the preceding code snippet is the name of the container we created in the Azure Blob storage account. After executing the preceding code, we are authenticating to the Azure blob using a SAS token.
  6. You can read the files and folders after authenticating by running display(dbutils.fs.ls(storageEndpointFolders)).
You have been reading a chapter from
Azure Databricks Cookbook
Published in: Sep 2021
Publisher: Packt
ISBN-13: 9781789809718
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image