Loading and transforming the data
Even though MLlib is designed with RDDs and DStreams in focus, for ease of transforming the data we will read the data and convert it to a DataFrame.
Note
The DStreams are the basic data abstraction for Spark Streaming (see http://bit.ly/2jIDT2A)
Just like in the previous chapter, we first specify the schema of our dataset.
Note
Note that here (for brevity), we only present a handful of features. You should always check our GitHub account for this book for the latest version of the code: https://github.com/drabastomek/learningPySpark.
Here's the code:
import pyspark.sql.types as typ labels = [ ('INFANT_ALIVE_AT_REPORT', typ.StringType()), ('BIRTH_YEAR', typ.IntegerType()), ('BIRTH_MONTH', typ.IntegerType()), ('BIRTH_PLACE', typ.StringType()), ('MOTHER_AGE_YEARS', typ.IntegerType()), ('MOTHER_RACE_6CODE', typ.StringType()), ('MOTHER_EDUCATION', typ.StringType()), ('FATHER_COMBINED_AGE', typ.IntegerType()), ('FATHER_EDUCATION'...