DataFrames – a whirlwind introduction
Let's start by opening a Spark shell:
$ spark-shell
Let's imagine that we are interested in running analytics on a set of patients to estimate their overall health level. We have measured, for each patient, their height, weight, age, and whether they smoke.
We might represent the readings for each patient as a case class (you might wish to write some of this in a text editor and paste it into the Scala shell using :paste
):
scala> case class PatientReadings( val patientId: Int, val heightCm: Int, val weightKg: Int, val age:Int, val isSmoker:Boolean ) defined class PatientReadings
We would, typically, have many thousands of patients, possibly stored in a database or a CSV file. We will worry about how to interact with external sources later in this chapter. For now, let's just hard-code a few readings directly in the shell:
scala> val readings = List( PatientReadings(1, 175, 72, 43, false), PatientReadings...