Designing Spark clusters
Designing a Spark cluster essentially means choosing the configurations for the cluster. Spark clusters in Databricks can be designed using the Compute section. Determining the right cluster configuration is very important for managing costs and data for different types of workloads. For example, a cluster that's used concurrently by several data analysts might not be a good fit for structured streaming or machine learning workloads. Before we decide on a Spark cluster configuration, several questions need to be asked:
- Who will be the primary user of the cluster? It could be a data engineer, data scientist, data analyst, or machine learning engineer.
- What kind of workloads run on the cluster? It could be an Extract, Transform, and Load (ETL) process for a data engineer or exploratory data analysis for a data scientist. An ETL process could also be further divided into batch and streaming workloads.
- What is the service-level agreement (SLA...