Converting a PySpark dataframe to an array
In order to form the building blocks of the neural network, the PySpark dataframe must be converted into an array. Python has a very powerful library, numpy
, that makes working with arrays simple.
Getting ready
The numpy
library should be already available with the installation of the anaconda3
Python package. However, if for some reason the numpy
library is not available, it can be installed using the following command at the terminal:
pip install
or sudo pip install
will confirm whether the requirements are already satisfied by using the requested library:
import numpy as np
How to do it...
This section walks through the steps to convert the dataframe into an array:
- View the data collected from the dataframe using the following script:
df.select("height", "weight", "gender").collect()
- Store the values from the collection into an array called
data_array
using the following script:
data_array = np.array(df.select("height", "weight", "gender").collect())
- Execute...