Collecting the data
A collect statement is used when we want to get all the data that is being processed in different clusters back to the driver. When using a collect statement, we need to make sure that the driver has enough memory to hold the processed data. If the driver doesn’t have enough memory to hold the data, we will get out-of-memory errors.
This is how you show the collect statement:
data_df.collect()
This statement will then show result as follows:
[Row(col_1=100, col_2=200.0, col_3='string_test_1', col_4=datetime.date(2023, 1, 1), col_5=datetime.datetime(2023, 1, 1, 12, 0)), Row(col_1=200, col_2=300.0, col_3='string_test_2', col_4=datetime.date(2023, 2, 1), col_5=datetime.datetime(2023, 1, 2, 12, 0)), Row(col_1=300, col_2=400.0, col_3='string_test_3', col_4=datetime.date(2023, 3, 1), col_5=datetime.datetime(2023, 1, 3, 12, 0))]
There are a few ways to avoid out-of-memory errors. We will explore some of the options that...