Using SparkR for computing summary statistics
The describe (or summary) operation creates a new that contains count, mean, max, mean, and standard deviation values for a specified DataFrame or a list of numerical columns:
> sumstatsdf <- describe(df, "duration", "campaign", "previous", "age") > showDF(sumstatsdf)
Computing these values on a large Dataset can be computationally expensive. Hence, we present the individual computation of these statistical measures here:
> avgagedf <- agg(df, mean = mean(df$age)) > showDF(avgagedf) # Print this DF +-----------------+ | mean | +-----------------+ |40.02406040594348| +-----------------+
Next, we create a DataFrame that lists the minimum and maximum values and the range width:
> agerangedf <- agg(df, minimum = min(df$age), maximum = max(df$age), range_width = abs(max(df$age) - min(df$age))) > showDF(agerangedf)
Next, we compute the sample variance and standard deviation as shown here:
> agevardf <- agg...