Inferring the schema using reflection
DataFrames have schema, RDDs don't. That is, unless RDDs are composed of Row(...)
objects.
In this recipe, we will learn how to create DataFrames by inferring the schema using reflection.
Getting ready
To execute this recipe, you need to have a working Spark 2.3 environment.
There are no other requirements.
How to do it...
In this example, we will first read our CSV sample data into an RDD and then create a DataFrame from it. Here's the code:
import pyspark.sql as sql sample_data_rdd = sc.textFile('../Data/DataFrames_sample.csv') header = sample_data_rdd.first() sample_data_rdd_row = ( sample_data_rdd .filter(lambda row: row != header) .map(lambda row: row.split(',')) .map(lambda row: sql.Row( Id=int(row[0]) , Model=row[1] , Year=int(row[2]) , ScreenSize=row[3] , RAM=row[4] , HDD=row[5] , W=float(row[6]) , D=float(row[7]) ...