Understanding Spark SQL optimizations
In this section, we will learn about how to write efficient Spark SQL queries, along with tips to help optimize the existing SQL queries:
- Avoid using
NOT IN
in the SQL queries, as it is a very expensive operation. - Filter the data before performing join operations by using the
WHERE
clause before joining the tables. - Mention the column name when using the
SELECT
clause instead of giving a*
to select all of them. Try to use the columns required for operations instead of selecting all of them unnecessarily. - Avoid using
LIKE
in theWHERE
clause, as it is another expensive operation. - Try not to join the same set of tables multiple times. Instead, write a common table expression (CTE) using the
WITH
clause to create a subquery, and use it to join the tables wherever necessary. - When joining the same table for different conditions, use the
CASE
statements.
In the next and final section of this chapter, we will learn about...