Learning partitioning strategies in Spark
In this section, we will discuss some of the useful strategies for Spark partitions and Apache Hive partitions. Whenever Spark processes data in memory, it breaks that data down into partitions, and these partitions are processed in the cores of the executors. These are the Spark partitions. On the other hand, Hive partitions help to organize persisted tables into parts based on columns.
Understanding Spark partitions
Before we learn about the strategies to manage Spark partitions, we need to know the number of partitions for any given DataFrame:
- To check the Spark partitions of a given DataFrame, we use the following syntax:
dataframe.rdd.getNumPartitions()
. Also, remember that the total number of tasks doing work on a Spark DataFrame is equal to the total number of partitions of that DataFrame. - Next, we will learn how to check the number of records in each Spark partition. We will begin with re-creating the airlines DataFrame...