Joining DataFrames together
So far, we have only considered operations on a single DataFrame. Spark also offers SQL-like joins to combine DataFrames. Let's assume that we have another DataFrame mapping the patient id to a (systolic) blood pressure measurement. We will assume we have the data as a list of pairs mapping patient IDs to blood pressures:
scala> val bloodPressures = List((1 -> 110), (3 -> 100), (4 -> 125)) bloodPressures: List[(Int, Int)] = List((1,110), (3,100), (4,125)) scala> val bloodPressureRDD = sc.parallelize(bloodPressures) res16: rdd.RDD[(Int, Int)] = ParallelCollectionRDD[74] at parallelize at <console>:24
We can construct a DataFrame from this RDD of tuples. However, unlike when constructing DataFrames from RDDs of case classes, Spark cannot infer column names. We must therefore pass these explicitly to .toDF
:
scala> val bloodPressureDF = bloodPressureRDD.toDF( "patientId", "bloodPressure") bloodPressureDF: DataFrame...