Summary and descriptive statistics
In this recipe, we will see how to get the summary statistics for data at scale in Spark. The descriptive summary statistics helps in understanding the distribution of data.
Getting ready
To step through this recipe, you need Ubuntu 14.04 (Linux flavor) installed on the machine. Also, have Apache Hadoop 2.6 and Apache Spark 1.6.0 installed.
How to do it…
Let's take an example of load prediction data. Here is what the sample data looks like:
Note
Download the data from the following location: https://github.com/ChitturiPadma/datasets/blob/master/Loan_Prediction_Data.csv.
- The preceding data contains numerical as well as categorical fields. We can get the summary of numerical fields as follows:
import org.apache.spark._ import org.apache.spark.sql._ object Summary_Statistics { def main(args:Array[String]): Unit = { val conf = new SparkConf() .setMaster("spark://master:7077") ...