Partitioning in Spark
In Apache Spark, partitioning is a critical concept that’s used to divide data across multiple nodes in a cluster for parallel processing. Partitioning improves data locality, enhances performance, and enables efficient computation by distributing data in a structured manner. Spark supports both static and dynamic partitioning strategies to organize data across the cluster nodes:
- Static partitioning of resources: Static partitioning is available on all cluster managers. With static partitioning, maximum resources are allocated to each application and these resources remain dedicated to these applications during their lifetime.
- Dynamic sharing of resources: Dynamic partitioning is only available on Mesos. When dynamically sharing resources, the Spark application gets fixed and independent memory allocation, such as static partitioning. The major difference is that when the tasks are not being run by an application, these cores can be used by...