Converting a PySpark DataFrame to a Pandas DataFrame
At various times in your workflow, you will want to switch from a Pyspark DataFrame to a Pandas DataFrame. There are options to convert a PySpark DataFrame to a Pandas DataFrame. This option is toPandas()
.
One thing to note here is that Python inherently is not distributed. Therefore, when a PySpark DataFrame is converted to Pandas, the driver would need to collect all the data in its memory. We need to make sure that the driver’s memory is able to collect the data in itself. If the data is not able to fit in the driver’s memory, it will cause an out-of-memory error.
Here’s an example to see how we can convert a PySpark DataFrame to a Pandas DataFrame:
data_df.toPandas()
As a result, you will see a DataFrame with our specified columns and their data types:
|
|
|