Accessing underlying RDDs
Switching to using DataFrames does not mean we need to completely abandon RDDs. Under the hood, DataFrames still use RDDs, but of Row(...)
objects, as explained earlier. In this recipe, we will learn how to interact with the underlying RDD of a DataFrame.
Getting ready
To execute this recipe, you need to have a working Spark 2.3 environment. Also, you should have already gone through the previous recipe as we will reuse the data we created there.
There are no other requirements.
How to do it...
In this example, we will extract the size of the HDD and its type into separate columns, and will then calculate the minimum volume needed to put each computer in boxes:
import pyspark.sql as sql import pyspark.sql.functions as f sample_data_transformed = ( sample_data_df .rdd .map(lambda row: sql.Row( **row.asDict() , HDD_size=row.HDD.split(' ')[0] ) ) .map(lambda row: sql.Row( **row.asDict() , HDD_type=row.HDD.split...