Best practices for statistics
Statistics are an integral part of any predictive modelling assignment. Statistics are important because they help us gauge the efficiency of a model. Each predictive model generates a set of statistics, which suggests how good the model is and how the model can be fine-tuned to perform better. The following is a summary of the most widely reported statistics and their desired values for the predictive models described in this book:
Algorithms |
Statistics/Parameter |
The desired value of statistics |
---|---|---|
Linear regression |
R2, p-values, F-statistic, and Adj. R2 |
High Adj. R2, low F-statistic, and low p-value |
Logistic regression |
Sensitivity, specificity, Area Under the Curve (AUC), and KS statistic |
High AUC (proximity to 1) |
Clustering |
Intra-cluster distance and silhouette coefficient |
High intra-cluster distance and high silhouette coefficient (proximity to 1) |
Decision trees (classification) |
AUC and KS statistics |
High AUC (proximity to 1) |
While reporting...