In the previous chapters, we learned how to use Spark to implement a variety of use cases using features such as RDDs, DataFrames, Spark SQL, MLlib, GraphX/Graphframes, and Spark Streaming. We also discussed how to monitor your applications to better understand their behavior in production. However, sometimes, you would want your jobs to run efficiently. We measure the efficiency of any job on two parameters: runtime and storage space. In the Spark application, you might also be interested in the statistic of the data shuffles between the nodes. We discussed some of the optimizations in the earlier chapters, but, in this chapter, we'll discuss more optimizations that can help you achieve some performance benefits.
Most developers focus only on writing their applications on Spark and do not focus on optimizing their job for a variety of reasons. This chapter...