Summary
In this chapter, we learned about several useful techniques to optimize Spark jobs when working with Spark DataFrames. We started by learning about the collect()
method and when to avoid using it, and ended with a discussion of some SQL optimization best practices and bucketing. We also learned about why Parquet files and Koalas should be adopted by data scientists using Databricks.
In the next chapter, we will learn about some of the most powerful optimization techniques with Delta Lake. We will develop a theoretical understanding of these optimizations, and we'll write code to understand their practical use in different scenarios.