Exploring Spark DataFrames
One of the major advantages that the Spark DataFrames offer over the traditional RDDs is the ease of data use and exploration. The data is stored in a more structured tabular format in the DataFrames and hence is easier to make sense of. We can compute basic statistics such as the number of rows and columns, look at the schema, and compute summary statistics such as mean and standard deviation.
Exercise 28: Displaying Basic DataFrame Statistics
In this exercise, we will show basic DataFrame statistics of the first few rows of the data, and summary statistics for all the numerical DataFrame columns and an individual DataFrame column:
Look at the DataFrame schema. The schema is displayed in a tree format on the console:
df.printSchema()
Now, use the following command to print the column names of the Spark DataFrame:
df.schema.names
To retrieve the number of rows and columns present in the Spark DataFrame, use...