Generating frequency tables
In this recipe, we will see how to analyze the distribution of various variables in the data. Generally, we can take a histogram/boxplot of the variables to understand the distribution and also identify the outliers. But currently, Spark has no support for plotting the data. Let's see how we can perform analysis by generating frequency tables.
Getting ready
To step through this recipe, you need Ubuntu 14.04 (Linux flavor) installed on the machine. Also, have Apache Hadoop 2.6 and Apache Spark 1.6.0 installed.
How to do it…
Let's take an example of load prediction data. Here is what the sample data looks like:
Note
Download the data from the following location: https://github.com/ChitturiPadma/datasets/blob/master/Loan_Prediction_Data.csv.
The total record count is 614
.
- Let us look at the chances of getting a loan-based on
Credit_History
. Here is the code to generate the frequency distribution of set of variables such asLoan_Status
andCredit_History...