Understanding the collect() method
Spark's collect()
function is an action, and it is used to retrieve all the elements of the Resilient Distributed Dataset (RDD) or DataFrame. We will first take a look at an example of using the function. Run the following code block:
from pyspark.sql.functions import * airlines_1987_to_2008 = (   spark   .read   .option("header",True)   .option("delimiter",",")   .option("inferSchema",True)   .csv("dbfs:/databricks-datasets/asa/airlines/*") ) display(airlines_1987_to_2008)
The preceding code block creates a Spark DataFrame and displays the first 1,000 records. Now, let's run some code with the collect()
function:
airlines_1987_to_2008.select('Year').distinct().collect()
The preceding line of code returns a list of row objects for the Year
column values. A row object is a collection of fields that can be iterated...