Understanding Various Spark Transformations
Spark supports transformations and actions. Transformations involve taking an input dataset, processing it, and getting an output dataset. Actions are about executing a computation on a dataset and returning a value to the driver program. One such example is the MapReduce construct, where map
is a transformation, and reduction
is an action. The transformations are done only when an action is triggered by the driver.
Spark has several core functions. The ETL aspects of the Spark pipeline transformation have already been covered in detail in Chapter 3, Data Preparation. The transformation for doing machine learning modeling will be the focus of this section. A typical AI workflow involves identifying the raw data in various formats, including SQL, CSV, and JSON.
Some of the popular transformations are map, filter, sample, union, intersection, and distinct. Applying any of these transformations to a dataset results in another new dataset...