Preparing data in Dataframes
Other than filtering, conversions, and transformations (with DataFrames which we saw in Chapter 2, Getting Started with Apache Spark DataFrames) , let's see a few more data preparation tricks in this recipe. We'll also be looking at specific data preparation in Chapter 5, Learning from Data, where we will focus on using various machine learning algorithms.
How to do it...
While preprocessing data, we may be required to:
Merge two different datasets
Perform set operations on two datasets
Sort the DataFrame by casting an attribute value
Choose a member from one dataset over another based on the predicate
Parse arbitrary date/time inputs
We'll use the StudentPrep1.csv
and StudentPrep2.csv
datasets for the first four tasks, and for the last one, we'll use StrangeDate.json
, a JSON-based dataset. The CSV and the JSON dataset are chosen primarily for convenience—the input data could be anything.
The StudentPrep1.csv
dataset is shown in this screenshot:
The StudentPrep2.csv
dataset...