Sampling data with Spark SQL APIs
Often, we need to visualize data points to the nature of our data. Statisticians use sampling techniques extensively for data analysis. Spark supports both approximate and exact sample generation. Approximate sampling is faster and is often good enough in most cases.
In this section, we will explore Spark SQL APIs used for generating samples. We will work through some examples of generating approximate and exact stratified samples, with and without replacement, using the DataFrame/Dataset API and RDD-based methods.
Sampling with the DataFrame/Dataset API
We can use the sampleBy
to create a sample replacement. We can specify the fractions for the percentages of each value to be selected in the sample.
The size of the and the of record of each type are shown here:
Next, we create a sample with replacement that selects a fraction of rows (10% of the total records) using a random seed. Using sample
is not guaranteed to provide the exact fraction of the total...