Building data model in Delta Lake and data pipeline jobs with Databricks
Apache Spark is a well-known big data framework that is often used for big data ETL/ELT jobs and machine learning tasks. ADF allows us to utilize its capabilities in two different ways:
- Running Spark in an HDInsight cluster
- Running Databricks notebooks and JAR and Python files
Running Spark in an HDInsight cluster is very similar to the previous recipe. So, we will concentrate on the Databricks service. It also allows running interactive notebooks, which significantly simplifies the development of the ETL/ELT pipelines and machine learning tasks.In this recipe, we will connect Azure Data Lake Storage to Databricks, ingest the MovieLens dataset, transform the data, and store the resulting dataset as delta table in Azure Data Lake Storage.
Getting ready
First, log in to your Microsoft Azure account.We assume you have a pre-configured resource group and storage account with Azure Data Lake Gen2 and the Azure Databricks...