As discussed in Chapter 1, Introduction to Apache Spark, Spark can scale horizontally. This means that performance will increase if you add more nodes to your cluster, because Spark can perform more operations in parallel. Spark also enables users to take good advantage of memory, and a fast network can also help in optimizing shuffle data. Because of all of these reasons, more hardware is always better.
Cluster-level optimizations
Memory
Efficient use of memory is critical for good performance. In the earlier versions of Spark, the memory was used for three main purposes:
- RDD storage
- Shuffle and aggregation storage
- User code
Memory was divided among them with some fixed proportions. For example, RDD storage had 60%, shuffle...