Creating RDDs
There are two ways to create an RDD in PySpark: you can either .parallelize(...)
a collection (list
or an array
of some elements):
data = sc.parallelize( [('Amber', 22), ('Alfred', 23), ('Skye',4), ('Albert', 12), ('Amber', 9)])
Or you can reference a file (or files) located either locally or somewhere externally:
data_from_file = sc.\ textFile( '/Users/drabast/Documents/PySpark_Data/VS14MORT.txt.gz', 4)
Note
We downloaded the Mortality dataset VS14MORT.txt
file from (accessed on July 31, 2016) ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/mort2014us.zip; the record schema is explained in this document http://www.cdc.gov/nchs/data/dvs/Record_Layout_2014.pdf. We selected this dataset on purpose: The encoding of the records will help us to explain how to use UDFs to transform your data later in this chapter. For your convenience, we also host the file here: http://tomdrabas.com/data/VS14MORT.txt.gz
The last parameter in sc.textFile...