Processing big data with Apache Spark
Apache Spark is a well-known big data framework that is often used for big data ETL/ELT jobs and machine learning tasks. ADF allows us to utilize its capabilities in two different ways:
- Running Spark in an HDInsight cluster
- Running Databricks notebooks and JAR and Python files
Running Spark in an HDInsight cluster is very similar to the previous recipe. So, we will concentrate on the Databricks service. It also allows running interactive notebooks, which significantly simplifies the development of the ETL/ELT pipelines and machine learning tasks.
Getting ready
Assuming you have a preconfigured resource group and storage account with Azure Data Lake Gen2, log in to your Microsoft Azure account. To run Databricks notebooks, you have to switch to a pay-as-you-go subscription.
How to do it…
- Go to the Azure portal and find Databricks.
- Click + Add and fill in the project details.
- Select your subscription...