Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Optimizing Databricks Workloads
Optimizing Databricks Workloads

Optimizing Databricks Workloads: Harness the power of Apache Spark in Azure and maximize the performance of modern big data workloads

Arrow left icon
Profile Icon Anirudh Kala Profile Icon Sarbahi Profile Icon Bhatnagar
Arrow right icon
€18.99 per month
Full star icon Full star icon Full star icon Full star icon Half star icon 4.1 (13 Ratings)
Paperback Dec 2021 230 pages 1st Edition
eBook
€8.99 €26.99
Paperback
€32.99
Subscription
Free Trial
Renews at €18.99p/m
Arrow left icon
Profile Icon Anirudh Kala Profile Icon Sarbahi Profile Icon Bhatnagar
Arrow right icon
€18.99 per month
Full star icon Full star icon Full star icon Full star icon Half star icon 4.1 (13 Ratings)
Paperback Dec 2021 230 pages 1st Edition
eBook
€8.99 €26.99
Paperback
€32.99
Subscription
Free Trial
Renews at €18.99p/m
eBook
€8.99 €26.99
Paperback
€32.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Optimizing Databricks Workloads

Chapter 1: Discovering Databricks

The original creators of Apache Spark established Databricks to solve the world's toughest data problems. Databricks was launched as a Spark-based unified data analytics platform in the cloud.

In this chapter, we will begin by understanding the internal architecture of Apache Spark™. This will be followed by an introduction to the basic components of Databricks. The following topics will be covered in this chapter:

  • Introducing Spark fundamentals
  • Introducing Databricks
  • Learning about Delta Lake

Technical requirements

For this chapter, you will need the following:

  • An Azure subscription
  • Azure Databricks

Please refer to the code sample from: Code samples from https://github.com/PacktPublishing/Optimizing-Databricks-Workload/tree/main/Chapter01

Introducing Spark fundamentals

Spark is a distributed data processing framework capable of analyzing large datasets. At its very core, it consists of the following:

  • DataFrames: Fundamental data structures consisting of rows and columns.
  • Machine Learning (ML): Spark ML provides ML algorithms for processing big data.
  • Graph processing: GraphX helps to analyze relationships between objects.
  • Streaming: Spark's Structured Streaming helps to process real-time data.
  • Spark SQL: A SQL to Spark engine with query plans and a cost-based optimizer.

DataFrames in Spark are built on top of Resilient Distributed Datasets (RDDs), which are now treated as the assembly language of the Spark ecosystem. Spark is compatible with various programming languages – Scala, Python, R, Java, and SQL.

Spark encompasses an architecture with one driver node and multiple worker nodes. The driver and worker nodes together constitute a Spark cluster. Under the hood, these nodes are based in Java Virtual Machines (JVMs). The driver is responsible for assigning and coordinating work between the workers.

Figure 1.1 – Spark architecture – driver and workers

The worker nodes have executors running inside each of them, which host the Spark program. Each executor consists of one or more slots that act as the compute resource. Each slot can process a single unit of work at a time.

Figure 1.2 – Spark architecture – executors and slots

Every executor reserves memory for two purposes:

  • Cache
  • Computation

The cache section of the memory is used to store the DataFrames in a compressed format (called caching), while the compute section is utilized for data processing (aggregations, joins, and so on). For resource allocation, Spark can be used with a cluster manager that is responsible for provisioning the nodes of the cluster. Databricks has an in-built cluster manager as part of its overall offering.

Note

Executor slots are also called cores or threads.

Spark supports parallelism in two ways:

  • Vertical parallelism: Scaling the number of slots in the executors
  • Horizontal parallelism: Scaling the number of executors in a Spark cluster

Spark processes the data by breaking it down into chunks called partitions. These partitions are usually 128 MB blocks that are read by the executors and assigned to them by the driver. The size and the number of partitions are decided by the driver node. While writing Spark code, we come across two functionalities, transformations and actions. Transformations instruct the Spark cluster to perform changes to the DataFrame. These are further categorized into narrow transformations and wide transformations. Wide transformations lead to the shuffling of data as data requires movement across executors, whereas narrow transformations do not lead to re-partitioning across executors.

Running these transformations does not make the Spark cluster do anything. It is only when an action is called that the Spark cluster begins execution, hence the saying Spark is lazy. Before executing an action, all that Spark does is make a data processing plan. We call this plan the Directed Acyclic Graph (DAG). The DAG consists of various transformations such as read, filter, and join and is triggered by an action.

Figure 1.3 – Transformations and actions

Every time a DAG is triggered by an action, a Spark job gets created. A Spark job is further broken down into stages. The number of stages depends on the number of times a shuffle occurs. All narrow transformations occur in one stage while wide transformations lead to the formation of new stages. Each stage comprises of one or more tasks and each task processes one partition of data in the slots. For wide transformations, the stage execution time is determined by its slowest running task. This is not the case with narrow transformations.

At any moment, one or more tasks run parallelly across the cluster. Every time a Spark cluster is set up, it leads to the creation of a Spark session. This Spark session provides entry into the Spark program and is accessible with the spark keyword.

Sometimes, a few tasks process small partitions while others process larger chunks, we call this data skewing. This skewing of data should always be avoided if you hope to run efficient Spark jobs. In a broad execution, the stage is determined by its slowest task, so if a task is slow, the overall stage is slow and everything waits for that to finish. Also, whenever a wide transformation is run, the number of partitions of the data in the cluster changes to 200. This is a default setting, but can be modified using Spark configuration.

Figure 1.4 – Jobs, stages, and tasks

As a rule of thumb, the total number of partitions should always be in the multiples of the total slots in the cluster. For instance, if a cluster has 16 slots and the data has 16 partitions, then each slot receives 1 task that processes 1 partition. But instead, if there are 15 partitions, then 1 slot will remain empty. This leads to the state of cluster underutilization. In the case of 17 partitions, a job will take twice the time to complete as it will wait for that 1 extra task to finish processing.

Let's move on from Spark for now and get acquainted with Databricks.

Introducing Databricks

Databricks provides a collaborative platform for data engineering and data science. Powered by the potential of Apache Spark™, Databricks helps enable ML at scale. It has also revolutionized the existing data lakes by introducing the Lakehouse architecture. You can refer to the following published whitepaper to learn about the Lakehouse architecture in detail: http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf.

Irrespective of the data role in any industry, Databricks has something for everybody:

  • Data engineer: Create ETL and ELT pipelines to run big data workloads.
  • Data scientist: Perform exploratory data analysis and train ML models at scale.
  • Data analyst: Perform big data analytics harnessing the power of Apache Spark.
  • Business intelligence analyst: Build powerful dashboards using Databricks SQL Analytics.

Databricks and Spark together provide a unified platform for big data processing in the cloud. This is possible because Spark is a compute engine that remains decoupled from storage. Spark in Databricks combines ETL, ML, and real-time streaming with collaborative notebooks. Processing in Databricks can scale to petabytes of data and thousands of nodes in no time!

Spark can connect to any number of data sources, including Amazon S3, Azure Data Lake, HDFS, Kafka, and many more. As Databricks lives in the cloud, spinning up a Spark cluster is possible with the click of a button. We do not need to worry about setting up infrastructure to use Databricks. This enables us to focus on the data at hand and continue solving problems.

Currently, Databricks is available on all four major cloud platforms, Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform, and Alibaba Cloud. In this book, we will be working on Azure Databricks with the standard pricing tier. Databricks is a first-party service in Azure and is deeply integrated with the complete Azure ecosystem.

Since Azure Databricks is a cloud-native managed service, there is a cost associated with its usage. To view the Databricks pricing, check out https://azure.microsoft.com/en-in/pricing/details/databricks/.

Creating an Azure Databricks workspace

To create a Databricks instance in Azure, we will need an Azure subscription and a resource group. An Azure subscription is a gateway to Microsoft's cloud services. It entitles us to create and use Azure's services. A resource group in Azure is equivalent to a logical container that hosts the different services. To create an Azure Databricks instance, we need to complete the following steps:

  1. Go to Azure's website, portal.azure.com, and log in with your credentials.
  2. In the Navigate section, click on Resource groups and then Create.
  3. In Subscription, set the name of the Azure subscription, set a suitable name in Resource group, set Region, and click on Review + create.
  4. A Validation passed message will flash at the top; then, click on Create. Once the resource group is created, a notification will pop up saying Resource group created. Following the message, click on Go to resource group.
  5. This opens an empty resource group. Now it is time to create a Databricks instance. Click on Create and select Marketplace. In the search bar, type Azure Databricks and select it from the drop-down menu. Click on Create.

    Figure 1.5 – Creating Azure Databricks

  6. Set Subscription, Resource group, a name, Region, and Pricing Tier (Standard, Premium, or Trial). In this book, we will be working with the standard pricing tier of Azure Databricks.

    Figure 1.6 – Creating an Azure Databricks workspace

  7. Click on Next : Networking. We will not be creating Databricks in a VNet. So, keep both the options toggled to No. Finally, click on Review + create. Once the validation is successful, click on Create. It takes a few minutes for the Databricks instance to get created. Once the deployment is complete, click on Go to Resource. To launch the Databricks workspace, click on Launch Workspace.
Figure 1.7 – Azure Databricks workspace

Figure 1.7 – Azure Databricks workspace

Now that we have a workspace up and running, let's explore how we can apply it to different concepts.

Core Databricks concepts

The Databricks workspace menu is displayed on the left pane. We can configure the menu based on our workloads, Data Science and Engineering or Machine Learning. Let's start with the former. We will learn more about the ML functionalities in Chapter 3, Learning about Machine Learning and Graph Processing with Databricks. The menu consists of the following:

  • Workspace: A repository with a folder-like structure that contains all the Azure Databricks assets. These assets include the following:
    • Notebooks: An interface that holds code, visualizations, and markdown text. Notebooks can also be imported and exported from the Databricks workspace.
    • Library: A package of commands made available to a notebook or a Databricks job.
    • Dashboard: A structured representation of the selective visualizations used in a notebook.
    • Folder: A logical grouping of related assets in the workspace.
  • Repos: This provides integration with Git providers such as GitHub, Bitbucket, GitLab, and Azure DevOps.
  • Recents: Displays the most recently used notebooks in the Databricks workspace.
  • Search: The search bar helps us to find assets in the workspace.
  • Data: This is the data management tab that is built on top of a Hive metastore. Here, we can find all the Hive tables registered in the workspace. The Hive metastore stores all the metadata information, such as column details, storage path, and partitions, but not the actual data that resides in a cloud storage location. The tables are queried using Apache Spark APIs including Python, Scala, R, and SQL. Like any data warehouse, we need to create a database and then register the Hive tables.
  • Compute: This is where we interact with the Spark cluster. It is further divided into three categories:
    • All-Purpose Clusters: These are used to run Databricks notebooks and jobs. An all-purpose cluster can also be shared among users in the workspace. You can manually terminate or restart an all-purpose cluster.
    • Job Clusters: These are created by the Azure Databricks job scheduler when a new job is created. Once the job is complete, it is automatically terminated. It is not possible to restart a job cluster.
    • Pools: A pool is a set of idle node instances that help reduce cluster start up and autoscaling times. When a cluster is attached to a pool, it acquires the driver and worker instances from within the pool.
Figure 1.8 – Clusters in Azure Databricks

Figure 1.8 – Clusters in Azure Databricks

  • Jobs: A job is a mechanism that helps to schedule notebooks for creating data pipelines.

Now that we have an understanding of the core concepts of Databricks, let's create our first Spark cluster!

Creating a Spark cluster

It is time to create our first cluster! In the following steps, we will create an all-purpose cluster and later attach it to a notebook. We will be discussing cluster configurations in detail in Chapter 4, Managing Spark Clusters:

  1. Select the Clusters tab and click on Create Cluster. Give an appropriate name to the cluster.
  2. Set Cluster Mode to Standard. Standard cluster mode is ideal when there is a single user for the cluster. High Concurrency mode is recommended for concurrent users, but it does not support Scala. A Single Node cluster has no worker nodes and is only suitable for small data volumes.
  3. Leave the Pool option as None and Databricks Runtime Version as the default value. The Databricks runtime version decides the Spark version and configurations for the cluster.
  4. For Autopilot Options, disable the Enable autoscaling checkbox. Autoscaling helps the cluster to automatically scale between the maximum and minimum number of worker nodes. In the second autopilot option, replace 120 with 030 to terminate the cluster after 30 minutes of inactivity.
  5. We can leave the Worker Type and Driver Type options as their default values. Set the number of workers to 01. Keep the Spot instances checkbox disabled. When enabled, the cluster uses Azure Spot VMs to save costs.
  6. Click on Create Cluster.
Figure 1.9 – Initializing a Databricks cluster

Figure 1.9 – Initializing a Databricks cluster

With the Spark cluster initialized, let's create our first Databricks notebook!

Databricks notebooks

Now we'll create our first Databricks notebook. On the left pane menu, click on Create and select Notebook. Give a suitable name to the notebook, keep the Default Language option as Python, set Cluster, and click on Create.

Figure 1.10 – Creating a Databricks notebook

Figure 1.10 – Creating a Databricks notebook

We can create documentation cells to independently run blocks of code. A new cell can be created with the click of a button. For people who have worked with Jupyter Notebooks, this interface might look familiar.

We can also execute code in different languages right inside one notebook. For example, the first notebook that we've created has a default language of Python, but we can also run code in Scala, SQL, and R in the same notebook! This is made possible with the help of magic commands. We need to specify the magic command at the beginning of a new cell:

  • Python: %python or %py
  • R: %r
  • Scala: %scala
  • SQL: %sql

    Note

    The %pip magic command can also be used in Databricks notebooks to manage notebook-scoped libraries.

Let us look at executing code in multiple languages in the following image:

Figure 1.11 – Executing code in multiple languages

Figure 1.11 – Executing code in multiple languages

We can also render a cell as Markdown using the %md magic command. This allows us to add rendered text between cells of code.

Databricks notebooks also support rendering HTML graphics using the displayHTML function. Currently, this feature is only supported for Python, R, and Scala notebooks. To use the function, we need to pass in HTML, CSS, or JavaScript code:

Figure 1.12 – Rendering HTML in a notebook

Figure 1.12 – Rendering HTML in a notebook

We can use the %sh magic command to run shell commands on the driver node.

Databricks provides a Databricks Utilities (dbutils) module to perform tasks collectively. With dbutils, we can work with external storage, parametrize notebooks, and handle secrets. To list the available functionalities of dbutils, we can run dbutils.help() in a Python or Scala notebook.

The notebooks consist of another feature called widgets. These widgets help to add parameters to a notebook and are made available with the dbutils module. By default, widgets are visible at the top of a notebook and are categorized as follows:

  • Text: Input the string value in a textbox.
  • Dropdown: Select a value from a list.
  • Combobox: Combination of text and drop-down widgets.
  • Multiselect: Select one or more values from a list:
Figure 1.13 – Notebook widget example. Here, we create a text widget, 
fetch its value, and call it in a print statement

Figure 1.13 – Notebook widget example. Here, we create a text widget, fetch its value, and call it in a print statement

We can also run one notebook inside another using the %run magic command. The magic command must be followed by the notebook path.

Figure 1.14 – Using the %run magic command

Figure 1.14 – Using the %run magic command

Databricks File System (DBFS)

DBFS is a filesystem mounted in every Databricks workspace for temporary storage. It is an abstraction on top of a scalable object store in the cloud. For instance, in the case of Azure Databricks, the DBFS is built on top of Azure Blob Storage. But this is managed for us, so we needn't worry too much about how and where the DBFS is actually located. All we need to understand is how can we use DBFS inside a Databricks workspace.

DBFS helps us in the following ways:

  • We can persist data to DBFS so that it is not lost after the termination of the cluster.
  • DBFS allows us to mount object storage such as Azure Data Lake or Azure Blob Storage in the workspace. This makes it easy to access data without requiring credentials every time.
  • We can access and interact with the data using the directory semantics instead of using URLs for object storage.

DBFS has a default storage location called the DBFS root. We can access DBFS in several ways:

  • With the %fs magic command: We can use the %fs command in a notebook cell.
  • With dbutils: We can call the dbutils module to access the DBFS. Using dbutils.fs.ls("<path>") is equivalent to running %fs ls <path>. Here, <path> is a DBFS path. Both these commands list the directories in a specific DBFS "path."
Figure 1.15 – Listing all files in the DBFS root using the %fs magic command

Figure 1.15 – Listing all files in the DBFS root using the %fs magic command

Note

We need to enclose dbutils.fs.ls("path") in Databricks' display() function to obtain a rendered output.

Databricks jobs

A Databricks job helps to run and automate activities such as an ETL job or a data analytics task. A job can be executed either immediately or on a scheduled basis. A job can be created by using the UI or CLI or invoking the Jobs UI. We will now create a job using the Databricks UI:

  1. Create a new Databricks Python notebook with the name jobs-notebook and paste the following code in a new cell. This code creates a new delta table and inserts records into the table. We'll learn about Delta Lake in more detail later in this chapter. Note that the following two code blocks must be run in the same cell.

    The following code block creates a delta table in Databricks with the name of insurance_claims. The table has four columns, user_id, city, country, and amount:

    %sql
    -- Creating a delta table and storing data in DBFS
    -- Our table's name is 'insurance_claims' and has four columns
    CREATE OR REPLACE TABLE insurance_claims (
      user_id INT NOT NULL,
      city STRING NOT NULL,
      country STRING NOT NULL,
      amount INT NOT NULL
    )
    USING DELTA
    LOCATION 'dbfs:/tmp/insurance_claims';

    Now, we will insert five records into the table. In the following code block, every INSERT INTO statement inserts one new record into the delta table:

    INSERT INTO insurance_claims (user_id, city, country, amount)
    VALUES (100, 'Mumbai', 'India', 200000);
    INSERT INTO insurance_claims (user_id, city, country, amount)
    VALUES (101, 'Delhi', 'India', 400000);
    INSERT INTO insurance_claims (user_id, city, country, amount)
    VALUES (102, 'Chennai', 'India', 100000);
    INSERT INTO insurance_claims (user_id, city, country, amount)
    VALUES (103, 'Bengaluru', 'India', 700000);
  2. On the workspace menu, click on Jobs and then Create Job. Give a name to the job and keep Schedule Type as Manual (Paused). Under the Task heading, set Type to Notebook, select the jobs-notebook notebook that we created, and in Cluster, select an existing all-purpose cluster.
  3. Keep Maximum Concurrent Runs at the default value of 1. Under Alerts, click on Add. Add the email address to which alerts must be sent and select Success and Failure. This will ensure that the designated email address will be notified upon a job success or failure.
  4. Click on Create. Once the job is created, click on Runs and select Run Now.
  5. As soon as the job completes, we will receive an email informing us whether the job succeeded or failed. If the job is in progress, we can find more information about the current run under Active Runs.
  6. When the job finishes, a new record will be added under Completed Runs (past 60 days) giving the start time, mode of launch, duration of run, and status of run.
Figure 1.16 – Successful manual run of a Databricks job

Figure 1.16 – Successful manual run of a Databricks job

Databricks Community

Databricks Community is a platform that provides a free-of-cost Databricks workspace. It supports a single node cluster wherein we have one driver and no workers. The community platform is great for beginners to get started with Databricks. But several features of Azure Databricks are not supported in the Community edition. For example, we cannot create jobs or change cluster configuration settings. To sign up for Databricks Community, visit https://community.cloud.databricks.com/login.html.

Learning about Delta Lake

Delta Lake was launched by Databricks as an open source project owned by the Linux Foundation that converts a traditional data lake into a lakehouse. The term lakehouse refers to a platform that brings in the best of both data lakes and warehouses. Delta Lake offers the following features:

  • ACID transactions: Data readers never see inconsistent data, also called dirty reads.
  • Handling metadata: Spark's distributed processing power makes it easy to handle the metadata of tables with terabytes and petabytes of data in a scalable fashion.
  • Streaming and batch workloads: Delta Lake helps unify stream and batch processing on data.
  • Schema enforcement: The schema is always checked before data gets appended. This helps to prevent the corruption of data.
  • Time travel: View and roll back to previous versions of data enabling audit logs.
  • Upserts and deletes: Perform transactions such as updates, inserts, and deletes on data lying in the lake.

Next, we'll look at big data file formats.

Big data file formats

Before we dive deeper into Delta Lake, let's first try to understand the file formats used to store big data. Traditional file formats such as CSV and TSV store data in a row-wise format and are not partitioned. CSVs are basically strings without any data types so they always need to be scanned entirely without any scope for filtering. This makes it difficult for processing and querying larger datasets. Instead, file formats such as Parquet, ORC, and Avro help us overcome many such challenges as they can be stored in a distributed fashion.

Note

Row-based file formats store data by row, whereas columnar file formats store data by column. Row-based file formats work best for transactional writes of data, whereas columnar file formats are ideal for data querying and analytical workloads.

Let us look at row-based file formats versus columnar file formats in the following image:

Figure 1.17 – Row-based file formats versus columnar

Figure 1.17 – Row-based file formats versus columnar

The similarities between Parquet, ORC, and Avro are as follows:

  • All three file formats are machine-readable formats and not human-readable.
  • They can be partitioned across a cluster for parallel processing and distribution.
  • The formats carry the data schema as well. This helps newer machines or clusters to process the data independently.

The differences between Parquet, ORC, and Avro are as follows:

Figure 1.18 – Comparison of Avro, Parquet, and ORC

Figure 1.18 – Comparison of Avro, Parquet, and ORC

Coming back to Delta Lake, it can simply be treated as a file format. Tables that are created on top of this delta file format are simply called delta tables. The delta file format is mainly composed of two components:

  • Transactional logs: A _delta_log folder is created when data is written in the delta file format. This folder stores files that record all the transactions to data.
  • Versioned Parquet files: This is the actual data that is written out as partitions. These Parquet partition files (.parquet) can also be compacted later using different functions. For efficient querying purposes, these Parquet partition files can also be distributed based on partition folders.
Figure 1.19 – Data written in delta format as viewed in Azure Blob Storage

Figure 1.19 – Data written in delta format as viewed in Azure Blob Storage

It's also important for us to understand the use and value of the transactional log.

Understanding the transactional log

Having an understanding of the transactional log is imperative when working with Delta Lake. Let's take a peek at the contents of the _delta_log folder.

Figure 1.20 – Contents of the _delta_log folder as viewed in Azure Blob Storage

Figure 1.20 – Contents of the _delta_log folder as viewed in Azure Blob Storage

Whenever a transaction is carried out on a delta table, the changes are recorded in the _delta_log folder in the form of JSON files. The naming conventions of these JSON files begin sequentially, starting with 000000.json. Subsequent JSON files are created as changes get committed (000001.json, 000002.json, and so on). Also, with each fresh transaction, a new set of Parquet files may be written. In this process, the new JSON file created in the _delta_log folder keeps a record of which Parquet files to reference and which to omit. This happens because every transaction to a delta table results in a new version of the table.

Figure 1.21 – JSON files in _delta_log

Figure 1.21 – JSON files in _delta_log

Let's see how this works with an example. Suppose we have a delta table with a _delta_log folder containing 00000.json. Suppose this JSON file references two Parquet files, part-0000.parquet and part-0001.parquet.

Now we have an UPDATE transaction carried out on the delta table. This creates a new JSON file in the _delta_log folder by the name of 00001.json. Also, a new Parquet file is added in the delta table's directory, part-0002.parquet. Upon checking the new JSON file, we find that it references part-0001.parquet and part-0002.parquet but omits part-0000.parquet.

Delta Lake in action

Let's start by creating a Spark DataFrame by reading a CSV file. Create a new Databricks Python notebook and spin up a Spark cluster with one driver, one worker, the standard type, and autoscaling disabled. Every code block in the following section must be executed in a new notebook cell:

  1. We will be using the airlines dataset from the databricks-datasets repository. Databricks provides many sample datasets in every workspace. These are part of the databricks-datasets directory of the DBFS. The following code block creates a new Spark DataFrame by specifying the first row as the header, automatically inferring the schema, and reading from a CSV file. Once the DataFrame is created, we will display the first five rows:
    airlines = (spark.read
                .option("header",True)
                .option("inferSchema",True)
                .option("delimiter",",")
                .csv("dbfs:/databricks-datasets/airlines/part-00000")
    # View the dataframe
    display(airlines.limit(5))
  2. Next, we will write the DataFrame as a delta file in the DBFS. Once the writing process is complete, we can look at the contents of the delta file. It contains four Parquet files and a _delta_log folder:
    airlines.write.mode("overwrite").format("delta").save("dbfs:/airlines/")

    We can view the location where the data is written in delta format:

    display(dbutils.fs.ls("dbfs:/airlines/"))
  3. Inside the _delta_log folder, we can find one JSON file:
    display(dbutils.fs.ls("dbfs:/airlines/_delta_log/"))
  4. Now, we will create a delta table using the delta file that is written to Azure Blob Storage. Here, we will switch from PySpark to Spark SQL syntax using the %sql magic command. The name of the delta table created is airlines_delta_table. A count operation on the newly created delta table returns the number of records in the table:
    %sql
    DROP TABLE IF EXISTS airlines_delta_table;
    CREATE TABLE airlines_delta_table USING DELTA LOCATION "dbfs:/airlines/";
    %sql
    SELECT COUNT(*) as count FROM airlines_delta_table
  5. Let's perform a DELETE operation on the delta table. This will delete all the rows where the Month column equals 10. This deletes 448,620 rows from the delta table:
    %sql
    DELETE FROM airlines_delta_table WHERE Month = '10'
  6. Next, we will perform an UPDATE operation on the delta table. This transaction will update the Dest column and replace all the SFO values with San Francisco. We can also see that 7,575 rows received updates in the table:
    %sql
    UPDATE airlines_delta_table SET Dest = 'San Francisco' WHERE Dest = 'SFO'
  7. Before we move forward, let's look at the Parquet files and transactional logs folder once again. Inside the delta file, we can see that more Parquet files have been added after two transactions (DELETE and UPDATE) were carried out:
    display(dbutils.fs.ls("dbfs:/airlines/"))
  8. Also, the _delta_log folder now contains two more JSON files, one for each transaction:
    display(dbutils.fs.ls("dbfs:/airlines/_delta_log/"))
  9. Finally, it is time for time travel! Running the DESCRIBE HISTORY command on the delta table returns a list of all the versions of the table:
    %sql
    -- Time travel
    DESCRIBE HISTORY airlines_delta_table
  10. Switching to a previous version is as easy as adding VERSION AS OF to the delta table. First, we'll try to query the data based on the condition that got updated. For instance, after the update operation, no record should have the SFO value. Hence, we get a count of 0:
    %sql
    -- Return count of rows where Dest = 'SFO' in current version that is version 2
    SELECT COUNT(*) FROM airlines_delta_table WHERE Dest = 'SFO'
  11. But when the same query is run on the previous version of the delta table (version 1), we get a count of 7,575. This is because this SQL query is querying on the data that existed before the update operation:
    %sql
    -- Return count of rows where Dest = 'SFO' in version 1
    SELECT COUNT(*) FROM airlines_delta_table VERSION AS OF 1 WHERE Dest = 'SFO'

Let's recap what we've covered in this first chapter.

Summary

In this chapter, we learned about the fundamentals of Spark, got an introduction to Databricks, and explored Delta Lake. We were introduced to Azure Databricks and the important workspace components. We learned how to create an Azure Databricks instance, a notebook in the Databricks workspace, Spark clusters, and Databricks jobs. We also learned about the important big data file formats such as Parquet, Avro, and ORC. We also learned about the fundamentals of Delta Lake and went through a worked-out example.

In the next chapter, we will dive deeper into the concepts of batch and stream processing in Azure Databricks. We'll also see more examples in the chapter to practice working hands-on in a Databricks environment.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Understand Spark optimizations for big data workloads and maximizing performance
  • Build efficient big data engineering pipelines with Databricks and Delta Lake
  • Efficiently manage Spark clusters for big data processing

Description

Databricks is an industry-leading, cloud-based platform for data analytics, data science, and data engineering supporting thousands of organizations across the world in their data journey. It is a fast, easy, and collaborative Apache Spark-based big data analytics platform for data science and data engineering in the cloud. In Optimizing Databricks Workloads, you will get started with a brief introduction to Azure Databricks and quickly begin to understand the important optimization techniques. The book covers how to select the optimal Spark cluster configuration for running big data processing and workloads in Databricks, some very useful optimization techniques for Spark DataFrames, best practices for optimizing Delta Lake, and techniques to optimize Spark jobs through Spark core. It contains an opportunity to learn about some of the real-world scenarios where optimizing workloads in Databricks has helped organizations increase performance and save costs across various domains. By the end of this book, you will be prepared with the necessary toolkit to speed up your Spark jobs and process your data more efficiently.

Who is this book for?

This book is for data engineers, data scientists, and cloud architects who have working knowledge of Spark/Databricks and some basic understanding of data engineering principles. Readers will need to have a working knowledge of Python, and some experience of SQL in PySpark and Spark SQL is beneficial.

What you will learn

  • Get to grips with Spark fundamentals and the Databricks platform
  • Process big data using the Spark DataFrame API with Delta Lake
  • Analyze data using graph processing in Databricks
  • Use MLflow to manage machine learning life cycles in Databricks
  • Find out how to choose the right cluster configuration for your workloads
  • Explore file compaction and clustering methods to tune Delta tables
  • Discover advanced optimization techniques to speed up Spark jobs

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Dec 24, 2021
Length: 230 pages
Edition : 1st
Language : English
ISBN-13 : 9781801819077
Vendor :
Databricks , Microsoft
Category :
Languages :
Concepts :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Dec 24, 2021
Length: 230 pages
Edition : 1st
Language : English
ISBN-13 : 9781801819077
Vendor :
Databricks , Microsoft
Category :
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 111.97
Optimizing Databricks Workloads
€32.99
Data Engineering with Apache Spark, Delta Lake, and Lakehouse
€36.99
Azure Databricks Cookbook
€41.99
Total 111.97 Stars icon
Banner background image

Table of Contents

12 Chapters
Section 1: Introduction to Azure Databricks Chevron down icon Chevron up icon
Chapter 1: Discovering Databricks Chevron down icon Chevron up icon
Chapter 2: Batch and Real-Time Processing in Databricks Chevron down icon Chevron up icon
Chapter 3: Learning about Machine Learning and Graph Processing in Databricks Chevron down icon Chevron up icon
Section 2: Optimization Techniques Chevron down icon Chevron up icon
Chapter 4: Managing Spark Clusters Chevron down icon Chevron up icon
Chapter 5: Big Data Analytics Chevron down icon Chevron up icon
Chapter 6: Databricks Delta Lake Chevron down icon Chevron up icon
Chapter 7: Spark Core Chevron down icon Chevron up icon
Section 3: Real-World Scenarios Chevron down icon Chevron up icon
Chapter 8: Case Studies Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.1
(13 Ratings)
5 star 30.8%
4 star 61.5%
3 star 0%
2 star 0%
1 star 7.7%
Filter icon Filter
Top Reviews

Filter reviews by




Subhasish Ghosh Jan 19, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Having read this book, must say, this is one of the best books authored on Azure Databricks in the market today. A few salient points:- Personally, there are very very few people whom I truly, deeply and sincerely admire & respect based on their combination of solid technical expertise, customer centricity + focus & ability to deliver complex projects under pressing timelines having worked across multiple projects across multiple continents, these past 15+ years. Anirudh Kala & Anshul Bhatnagar, both with Celebal Technologies, both of whom I have had the opportunity of working day & night on some supremely complex Azure projects past 2+ years - I can say, first-hand from my personal experience - are both probably the 2 best Technical Experts on Azure Databricks in India, and to write on this topic.- Secondly, I read through the eBook version of the book. It is straight to the point, matter of fact concepts explained, with code to execute wherein relevant.a) Core concepts including Delta Lake, core differences between parquet, avro & orc, Batch ETL process stages, Spark Structured Streaming WALs, MLflow, GraphFrames are well explained.b) Optimizing Databricks strategies and solutions are all excellent to save your precious DBUs (== dollars) and getting the best performance. Loved the section on 'column predicate pushdown' as an optimization technique.c) The chapter on Spark Core Performance Tuning is simply fantastic. Every Databricks Developer/Expert ought to know all these points by heart.- The chapter on Case Studies is excellent. Am very happy and proud to see that a few projects wherein I have had the opportunity of collaborating with the Authors on delivering some solid complex projects are mentioned therein... very proud of some of the work we have done in 2019-2021.- In summary, I strongly recommend getting this book if you're aspiring to or already a professional Data Engineer and/or Data Scientist and/or Data Analyst and/or Data Architects. Big Data is a truth today, and a solid grasp of Databricks, and how to optimize Spark workloads on Databricks for different use-cases efficiently, is a mandatory must.Once again, thank you to the authors for the wonderful effort, and to Packt, Priyanka Mhatre for inviting me to read the eBook and provide a review of the same. Am sure the Big Data community would love this Book.
Amazon Verified review Amazon
Emily Jan 07, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
As a data scientist building a full-stack data pipeline (extraction to analysis), Databricks is my “bread and butter” in terms of tools - I use SQL, Python, and Spark daily. As I write this, I worked through CHs 2, 4, and 7 (data processing topics) and am so glad I got this book. I just built an ELT pipeline for my Databricks workflow and this book was an excellent resource (especially since online documentation is somewhat limited and still growing). I am starting to work through CHs 5 and 3 (ML and analytics topics) as I build the 2nd half of the pipeline. So far it’s very helpful. I like this author’s style of presenting information so have recommended this book and their other books to both data scientists and data engineers.
Amazon Verified review Amazon
A Jan 17, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book gave me whole new prospective in addressing few of the challenges that I was facing. The authors have put a lot of thought into making this book useful to address real world challenges and it's well done. This book also provides detailed insights about the data lake and the spark.
Amazon Verified review Amazon
Kindle Customer Sep 09, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I recommend this book for beginners or intermediate users, as it not only covers the basics, but tries to go deeper into theory that not everyone knows even if they already use Databricks.
Amazon Verified review Amazon
Vinoth Dec 24, 2021
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
As a Data Engineer, I find this book to be very helpful covering wide array of topics on how to deal with data effectively and efficiently on Azure Databricks. This book covers spark fundamentals, Batch & Realtime processing( structured streaming) ,ML & Graph processing ,Optimization techniques (Managing Spark clusters, Partitioning, Bucketing),newly emerging concept of Delta Lake and Lakehouse, Real world scenarios. You will find this book especially useful if you work in a Databricks environment. It is very detailed with lots of code samples and all the code is on GitHub. Highly recommend it !!
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.