Creating DataFrame operations
As we have already discussed, DataFrames are the main building blocks of Spark data. They consist of rows and column data structures.
DataFrames in PySpark are created using the pyspark.sql.SparkSession.createDataFrame
function. You can use lists, lists of lists, tuples, dictionaries, Pandas DataFrames, RDDs, and pyspark.sql.Rows
to create DataFrames.
Spark DataFrames also has an argument named schema that specifies the schema of the DataFrame. You can either choose to specify the schema explicitly or let Spark infer the schema from the DataFrame itself. If you don’t specify this argument in the code, Spark will infer the schema on its own.
There are different ways to create DataFrames in Spark. Some of them are explained in the following sections.
Using a list of rows
The first way to create DataFrames we see is by using rows of data. You can think of rows of data as lists. They would share common header values for each of the values...