Creating a dataframe in PySpark
dataframes will serve as the framework for any and all data that will be used in building deep learning models. Similar to the pandas
library with Python, PySpark has its own built-in functionality to create a dataframe.
Getting ready
There are several ways to create a dataframe in Spark. One common way is by importing a .txt
, .csv
, or .json
file. Another method is to manually enter fields and rows of data into the PySpark dataframe, and while the process can be a bit tedious, it is helpful, especially when dealing with a small dataset. To predict gender based on height and weight, this chapter will build a dataframe manually in PySpark. The dataset used is as follows:
While the dataset will be manually added to PySpark in this chapter, the dataset can also be viewed and downloaded from the following link:
Finally, we will begin this chapter and future chapters...