Chapter 9: Batch and Streaming Data Processing with Azure Databricks
Databricks is a data engineering product built on top of Apache Spark and provides a unified, cloud optimized platform so that you can perform ETL, machine learning, and AI tasks on a large quantity of data.
Azure Databricks, as its name suggests, is the Databricks integration with Azure, which further provides fully managed Spark clusters, an interactive workspace for data visualization and exploration, Azure Data Factory, integration with data sources such as Azure Blob Storage, Azure Data Lake Storage, Azure Cosmos DB, Azure SQL Data Warehouse, and more.
Azure Databricks can process data from multiple and diverse data sources, such as SQL or NoSQL, structured or unstructured data, and also scale up as many servers as required to cater to any exponential data growth.
In this chapter, we'll cover the following recipes:
- Configuring the Azure Databricks environment
- Transforming data using...