Getting to know your data
In order to build a statistical model in an informed way, an intimate knowledge of the dataset is necessary. Without knowing the data it is possible to build a successful model, but it is then a much more arduous task, or it would require more technical resources to test all the possible combinations of features. Therefore, after spending the required 80% of the time cleaning the data, we spend the next 15% getting to know it!
Descriptive statistics
I normally start with descriptive statistics. Even though the DataFrames expose the .describe()
method, since we are working with MLlib
, we will use the .colStats(...)
method.
Note
A word of warning: the .colStats(...)
calculates the descriptive statistics based on a sample. For real world datasets this should not really matter but if your dataset has less than 100 observations you might get some strange results.
The method takes an RDD
of data to calculate the descriptive statistics of and return a MultivariateStatisticalSummary...