Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
The Kaggle Workbook

You're reading from   The Kaggle Workbook Self-learning exercises and valuable insights for Kaggle data science competitions

Arrow left icon
Product type Paperback
Published in Feb 2023
Publisher Packt
ISBN-13 9781804611210
Length 172 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Luca Massaron Luca Massaron
Author Profile Icon Luca Massaron
Luca Massaron
Konrad Banachewicz Konrad Banachewicz
Author Profile Icon Konrad Banachewicz
Konrad Banachewicz
Arrow right icon
View More author details
Toc

Ensembling the results

Now, having two models, what’s left is to mix them together and see if we can improve the results. As suggested by Jahrer we go straight for a blend of them, but we do not limit ourselves to producing just an average of the two (since our approach in the end has slightly differed from Jahrer’s one) and we will also try to get optimal weights for the blend. We start importing the out-of-fold predictions and get our evaluation function ready:

import pandas as pd
import numpy as np
from numba import jit
@jit
def eval_gini(y_true, y_pred):
    y_true = np.asarray(y_true)
    y_true = y_true[np.argsort(y_pred)]
    ntrue = 0
    gini = 0
    delta = 0
    n = len(y_true)
    for i in range(n-1, -1, -1):
        y_i = y_true[i]
        ntrue += y_i
        gini += y_i * delta
        delta += 1 - y_i
    gini = 1 - 2 * gini / (ntrue * (n - ntrue))
    return gini
lgb_oof = pd.read_csv("../input/workbook-lgb/lgb_oof.csv")
dnn_oof = pd.read_csv("../input/workbook-dae/dnn_oof.csv")
target = pd.read_csv("../input/porto-seguro-safe-driver-prediction/train.csv", usecols=['id','target'])

Once done, we convert the out-of-fold predictions of the LightGBM and the predictions of the neural network into ranks. We are doing so because the normalized Gini coefficient is based on rankings (as a ROC-AUC evaluation would be) and consequently blending rankings works better than blending the predicted probabilities::

lgb_oof_ranks = (lgb_oof.target.rank() / len(lgb_oof))
dnn_oof_ranks = (dnn_oof.target.rank() / len(dnn_oof))

Now we just test if, by combining the two models using different weights, we can get a better evaluation of the out-of-fold data:

baseline = eval_gini(y_true=target.target, y_pred=lgb_oof_ranks)
print(f"starting from a oof lgb baseline {baseline:0.5f}\n")
best_alpha = 1.0
for alpha in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]:
    ensemble = alpha * lgb_oof_ranks + (1.0 - alpha) * dnn_oof_ranks
    score = eval_gini(y_true=target.target, y_pred=ensemble)
    print(f"lgd={alpha:0.1f} dnn={(1.0 - alpha):0.1f} -> {score:0.5f}")
    
    if score > baseline:
        baseline = score
        best_alpha = alpha
        
print(f"\nBest alpha is {best_alpha:0.1f}")

When ready, by running the snippet we can get interesting results:

starting from a oof lgb baseline 0.28850
lgd=0.1 dnn=0.9 -> 0.27352
lgd=0.2 dnn=0.8 -> 0.27744
lgd=0.3 dnn=0.7 -> 0.28084
lgd=0.4 dnn=0.6 -> 0.28368
lgd=0.5 dnn=0.5 -> 0.28595
lgd=0.6 dnn=0.4 -> 0.28763
lgd=0.7 dnn=0.3 -> 0.28873
lgd=0.8 dnn=0.2 -> 0.28923
lgd=0.9 dnn=0.1 -> 0.28916
Best alpha is 0.8

It seems that blending a strong weight (0.8) on the LightGBM model and a weaker one (0.2) on the neural network will bring an even better-performing model. We immediately try this hypothesis by setting a blend of the same weights for the models and the ideal weights that we have found:

lgb_submission = pd.read_csv("../input/workbook-lgb/lgb_submission.csv")
dnn_submission = pd.read_csv("../input/workbook-dae/dnn_submission.csv")
submission = pd.read_csv(
"../input/porto-seguro-safe-driver-prediction/sample_submission.csv")

First, we try the equal weights solution, which was the strategy used by Michael Jahrer:

lgb_ranks = (lgb_submission.target.rank() / len(lgb_submission))
dnn_ranks = (dnn_submission.target.rank() / len(dnn_submission))
submission.target = lgb_ranks * 0.5 + dnn_ranks * 0.5
submission.to_csv("equal_blend_rank.csv", index=False)

It leads to a public score of 0.28393 and a private score of 0.29093, which is around 50th position on the final leaderboard, a bit far from our expectations. Now let’s try using the weights that the out-of-fold predictions helped us to find:

lgb_ranks = (lgb_submission.target.rank() / len(lgb_submission))
dnn_ranks = (dnn_submission.target.rank() / len(dnn_submission))
submission.target = lgb_ranks * best_alpha +  dnn_ranks * (1.0 - best_alpha)
submission.to_csv("blend_rank.csv", index=False)

Here the results lead to a public score of 0.28502 and a private score of 0.29192, which turns out to be around the seventh position on the final leaderboard. A much better result indeed because the LightGBM is a good model, but it is probably missing some nuances in the data that can be provided by adding some information from the neural network trained on the denoised data.

Exercise 6

As pointed out by CPMP in their solution (https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/44614), depending on how to build your cross-validation, you can experience a “huge variation of Gini scores among folds.” For this reason, CPMP suggests decreasing the variance of the estimates by using many different seeds for multiple cross-validations and averaging the results.

As an exercise, try to modify the code we used to create more stable predictions, especially for the denoising autoencoder.

Exercise Notes (write down any notes or workings that will help you):

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime