Transformations
Transformations shape your dataset. These include mapping, filtering, joining, and transcoding the values in your dataset. In this section, we will showcase some of the transformations available on RDDs.
Note
Due to space constraints we include only the most often used transformations and actions here. For a full set of methods available we suggest you check PySpark's documentation on RDDs http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.
Since RDDs are schema-less, in this section we assume you know the schema of the produced dataset. If you cannot remember the positions of information in the parsed dataset we suggest you refer to the definition of the extractInformation(...)
method on GitHub, code for Chapter 03
.
The .map(...) transformation
It can be argued that you will use the .map(...)
transformation most often. The method is applied to each element of the RDD: In the case of the data_from_file_conv
dataset, you can think of this as a transformation...