It will often be the case that some of the analysis you wish to perform will not be available within SparkR and you will need to extract some of the data from Spark objects, and return them to base R.
For example, we were able to run correlation and covariance functions earlier directly on a Spark dataframe, by specifying specific pairs of variables. However, we did not generate correlation matrices for the entire dataframe for a couple of reasons:
-
The capability to do this may not be built into the version of Spark that you are currently running
-
Even if it was available, these kinds of calculation could be very computationally expensive to perform
One strategy you may want to use is to use Spark functions to explore basic characteristics of the data, and/or utilize specialized packages written for Spark (such as MLlib) to perform this...