Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
The Kaggle Workbook

You're reading from   The Kaggle Workbook Self-learning exercises and valuable insights for Kaggle data science competitions

Arrow left icon
Product type Paperback
Published in Feb 2023
Publisher Packt
ISBN-13 9781804611210
Length 172 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Luca Massaron Luca Massaron
Author Profile Icon Luca Massaron
Luca Massaron
Konrad Banachewicz Konrad Banachewicz
Author Profile Icon Konrad Banachewicz
Konrad Banachewicz
Arrow right icon
View More author details
Toc

Understanding the evaluation metric

The metric used in the competition is the normalized Gini coefficient (named after the similar Gini coefficient/index used in economics), which has been previously used in another competition, the Allstate Claim Prediction Challenge (https://www.kaggle.com/competitions/ClaimPredictionChallenge). From that competition, we can get a very clear explanation of what this metric is about:

When you submit an entry, the observations are sorted from “largest prediction” to “smallest prediction.” This is the only step where your predictions come into play, so only the order determined by your predictions matters. Visualize the observations arranged from left to right, with the largest predictions on the left. We then move from left to right, asking “In the leftmost x% of the data, how much of the actual observed loss have you accumulated?” With no model, you can expect to accumulate 10% of the loss in 10% of the predictions, so no model (or a “null” model) achieves a straight line. We call the area between your curve and this straight line the Gini coefficient.

There is a maximum achievable area for a “perfect” model. We will use the normalized Gini coefficient by dividing the Gini coefficient of your model by the Gini coefficient of the perfect model.

There is no formulation proposed by the organizers of the competition for the Normalized Gini apart from this verbose description, but by reading the notebook from Mohsin Hasan (https://www.kaggle.com/code/tezdhar/faster-gini-calculation/notebook), we can figure out that it is calculated in two steps and can obtain some easy to understand pseudocode that reveals its inner workings. First, you get the Gini coefficient for your predictions, then you normalize it by dividing it by another Gini coefficient computed by pretending you have perfect predictions. Here is the pseudocode for the Gini coefficient:

order = indexes of sorted predictions (expressed as probabilities from lowest to highest)

sorted_actual = actual[order] = ground truth values sorted based on indexes of sorted predictions

cumsum_sorted_actual = cumulated sum of the sorted ground truth values

n = number of predictions

gini_coef = (sum(cumsum_sorted_actual ) / sum(sorted_actual ) - (n + 1) / 2) / n

Once you have the Gini coefficient for your predictions, you need to divide it by the Gini coefficient you compute using the ground truth values as they were your predictions (the case of having perfect predictions)

norm_gini_coef = gini_coef(predictions) / gini_coef(ground truth)

Another good explanation is provided in the notebook by Kilian Batzner: https://www.kaggle.com/code/batzner/gini-coefficient-an-intuitive-explanation. Using clear plots and some toy examples, Kilian tries to make sense of a not-so-common metric, yet routinely used by the actuarial departments of insurance companies.

The metric can be approximated by the ROC-AUC score or the Mann–Whitney U non-parametric statistical test (since the U statistic is equivalent to the area under the receiver operating characteristic curve – AUC) because it approximately corresponds to 2 * ROC-AUC - 1. Hence, maximizing the ROC-AUC is the same as maximizing the normalized Gini coefficient (for a reference see the Relation to other statistical measures section in the Wikipedia entry: https://en.wikipedia.org/wiki/Gini_coefficient).

The metric can also be approximately expressed as the covariance of scaled prediction rank and scaled target value, resulting in a more understandable rank association measure (see Dmitriy Guller: https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/40576).

From the point of view of the objective function, you can optimize for the binary log-loss (as you would do in a classification problem). Neither ROC-AUC nor the normalized Gini coefficient is differentiable, and they may be used only for metric evaluation on the validation set (for instance, for early stopping or for reducing the learning rate in a neural network). However, optimizing for the log-loss does not always improve the ROC-AUC and the normalized Gini coefficients and neither of them is directly differentiable.

There is actually a differentiable ROC-AUC approximation. You can read about how it works in Toon Calders, and Szymon Jaroszewicz Efficient AUC Optimization for Classification. European Conference on Principles of Data Mining and Knowledge Discovery. Springer, Berlin, Heidelberg, 2007: https://link.springer.com/content/pdf/10.1007/978-3-540-74976-9_8.pdf.

However, it seems that it is not necessary to use anything different from log-loss as an objective function and ROC-AUC or normalized Gini coefficient as an evaluation metric in the competition.

There are actually a few Python implementations for computing the normalized Gini coefficient among the Kaggle Notebooks. We have used here and suggest the work by CPMP (https://www.kaggle.com/code/cpmpml/extremely-fast-gini-computation/notebook) that uses Numba for speeding up computations: it is both exact and fast.

Exercise 2

In chapter 5 of The Kaggle Book (page 95 onward), we explained how to deal with competition metrics, especially if they are new and generally unknown.

As an exercise, can you find out how many competitions on Kaggle have used the normalized Gini coefficient as an evaluation metric?

Exercise Notes (write down any notes or workings that will help you):

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image