Building a LightGBM submission
Our exercise starts by working out a solution based on LightGBM. You can find the code already set for execution using Kaggle Notebooks at this address: https://www.kaggle.com/code/lucamassaron/workbook-lgb. Although we made the code readily available, we instead suggest you type or copy the code directly from the book and execute it cell by cell; understanding what each line of code does and personalizing the solution can make it perform even better.
When using LightGBM you don’t have to, and should not, turn on any of the GPU or TPU accelerators. GPU acceleration could be helpful only if you have installed the GPU version of LightGBM. You can find working hints on how to install such a GPU-accelerated version on Kaggle Notebooks using this example: https://www.kaggle.com/code/lucamassaron/gpu-accelerated-lightgbm.
We start by importing key packages (NumPy, pandas, and Optuna for hyperparameter optimization, LightGBM, and some utility functions). We also define a configuration class and instantiate it. We will discuss the parameters defined in the configuration class during the exploration of the code as we progress. What is important to remark here is that by using a class containing all your parameters it will be easier for you to modify them in a consistent way along with the code. In the heat of competition, it is easy to forget to update a parameter that is referred to in multiple places in the code, and it is always difficult to set the parameters when they are dispersed among cells and functions. A configuration class can save you a lot of effort and spare you mistakes along the way:
import numpy as np
import pandas as pd
import optuna
import lightgbm as lgb
from path import Path
from sklearn.model_selection import StratifiedKFold
class Config:
input_path = Path('../input/porto-seguro-safe-driver-prediction')
optuna_lgb = False
n_estimators = 1500
early_stopping_round = 150
cv_folds = 5
random_state = 0
params = {'objective': 'binary',
'boosting_type': 'gbdt',
'learning_rate': 0.01,
'max_bin': 25,
'num_leaves': 31,
'min_child_samples': 1500,
'colsample_bytree': 0.7,
'subsample_freq': 1,
'subsample': 0.7,
'reg_alpha': 1.0,
'reg_lambda': 1.0,
'verbosity': 0,
'random_state': 0}
config = Config()
The next step requires importing the training, test, and sample submission datasets. We do this using the pandas read_csv
function. We also set the index of the uploaded DataFrames to the identifier (the id
column) of each data example.
Since features that belong to similar groupings are tagged (using ind
, reg
, car
, and calc
tags in their labels) and also binary and categorical features are easy to locate (they use the bin
and cat
tags, respectively, in their labels), we can enumerate them and record them in lists:
train = pd.read_csv(config.input_path / 'train.csv', index_col='id')
test = pd.read_csv(config.input_path / 'test.csv', index_col='id')
submission = pd.read_csv(config.input_path / 'sample_submission.csv', index_col='id')
calc_features = [feat for feat in train.columns if "_calc" in feat]
cat_features = [feat for feat in train.columns if "_cat" in feat]
Then, we just extract the target (a binary target of 0s and 1s) and remove it from the training dataset:
target = train["target"]
train = train.drop("target", axis="columns")
At this point, as pointed out by Michael Jahrer, we can drop the calc
features. This idea has recurred a lot during the competition (https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/41970), especially in notebooks, because it could be empirically verified that dropping them improved both the local cross-validation score and the public leaderboard score (as a general rule, it’s important to keep track of both during feature selection). In addition, they also performed poorly in gradient boosting models (their importance is always below the average).
We can argue that, since they are engineered features, they do not contain new information in respect of their original features, but they just add noise to any model trained that comprises them:
train = train.drop(calc_features, axis="columns")
test = test.drop(calc_features, axis="columns")
Exercise 3
Based on the suggestions provided in The Kaggle Book on page 220 (Using feature importance to evaluate your work), as an exercise:
- Code your own feature selection notebook for this competition.
- Check what features should be kept and what should be discarded.
Exercise Notes (write down any notes or workings that will help you):
Categorical features are instead one-hot encoded. Because the same labels are present in the training and test datasets (the result of a careful train/test split between the two arranged by the Porto Seguro team), instead of the usual scikit-learn OneHotEncoder
(https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) we are going to use the pandas get_dummies
function (https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html). Since the pandas function may produce different encodings if the features and their levels differ from train to test set, we assert a check on the one-hot encoding, resulting in the same for both:
train = pd.get_dummies(train, columns=cat_features)
test = pd.get_dummies(test, columns=cat_features)
assert((train.columns==test.columns).all())
One-hot encoding the categorical features completes the data processing stage. We proceed to define our evaluation metric, the normalized Gini coefficient, as previously discussed. We will use the extremely fast Gini computation code proposed by CPMP, as mentioned before.
Since we are going to use a LightGBM model, we have to add a suitable wrapper (gini_lgb
) to return to the GBM algorithm the evaluation of the training and the validation datasets in a form that can work with it (see https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html?highlight=higher_better#lightgbm.Booster.eval – Each evaluation function should accept two parameters: preds, eval_data, and return (eval_name, eval_result, is_higher_better) or list of such tuples):
from numba import jit
@jit
def eval_gini(y_true, y_pred):
y_true = np.asarray(y_true)
y_true = y_true[np.argsort(y_pred)]
ntrue = 0
gini = 0
delta = 0
n = len(y_true)
for i in range(n-1, -1, -1):
y_i = y_true[i]
ntrue += y_i
gini += y_i * delta
delta += 1 - y_i
gini = 1 - 2 * gini / (ntrue * (n - ntrue))
return gini
def gini_lgb(y_true, y_pred):
eval_name = 'normalized_gini_coef'
eval_result = eval_gini(y_true, y_pred)
is_higher_better = True
return eval_name, eval_result, is_higher_better
As for the training parameters, we found that the parameters suggested by Michael Jahrer in his post (https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/44629) work perfectly.
You may also try to come up with the same parameters or similar performing ones by performing a search by Optuna (https://optuna.org/) if you set the optuna_lgb
flag to True
in the Config
class. Here the optimization tries to find the best values for key parameters, such as the learning rate and the regularization parameters, based on a five-fold cross-validation test on training data. To speed up things, early stopping on the validation itself is taken into account (which, we are aware, could actually advantage picking some parameters that can better overfit the validation fold – a good alternative could be to remove the early stopping callback and keep a fixed number of rounds for the training):
if config.optuna_lgb:
def objective(trial):
params = {
'learning_rate': trial.suggest_float("learning_rate", 0.01, 1.0),
'num_leaves': trial.suggest_int("num_leaves", 3, 255),
'min_child_samples': trial.suggest_int("min_child_samples",
3, 3000),
'colsample_bytree': trial.suggest_float("colsample_bytree",
0.1, 1.0),
'subsample_freq': trial.suggest_int("subsample_freq", 0, 10),
'subsample': trial.suggest_float("subsample", 0.1, 1.0),
'reg_alpha': trial.suggest_loguniform("reg_alpha", 1e-9, 10.0),
'reg_lambda': trial.suggest_loguniform("reg_lambda", 1e-9, 10.0),
}
score = list()
skf = StratifiedKFold(n_splits=config.cv_folds, shuffle=True,
random_state=config.random_state)
for train_idx, valid_idx in skf.split(train, target):
X_train = train.iloc[train_idx]
y_train = target.iloc[train_idx]
X_valid = train.iloc[valid_idx]
y_valid = target.iloc[valid_idx]
model = lgb.LGBMClassifier(**params,
n_estimators=1500,
early_stopping_round=150,
force_row_wise=True)
callbacks=[lgb.early_stopping(stopping_rounds=150,
verbose=False)]
model.fit(X_train, y_train,
eval_set=[(X_valid, y_valid)],
eval_metric=gini_lgb, callbacks=callbacks)
score.append(
model.best_score_['valid_0']['normalized_gini_coef'])
return np.mean(score)
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=300)
print("Best Gini Normalized Score", study.best_value)
print("Best parameters", study.best_params)
params = {'objective': 'binary',
'boosting_type': 'gbdt',
'verbosity': 0,
'random_state': 0}
params.update(study.best_params)
else:
params = config.params
During the competition, Tilii tested feature elimination using Boruta (https://github.com/scikit-learn-contrib/boruta_py). You can find his kernel here: https://www.kaggle.com/code/tilii7/boruta-feature-elimination/notebook. As you can check, there is no calc_feature
considered a confirmed feature by Boruta.
Exercise 4
In The Kaggle Book, we explain hyperparameter optimization (page 241 onward) and provide some key hyperparameters for the LightGBM model.
As an exercise:
Try to improve the hyperparameter search by Optuna by reducing or increasing the explored parameters where you deem it necessary, and also try alternative optimization methods, such as the random search or the halving search from scikit-learn (pages 245–246).
Exercise Notes (write down any notes or workings that will help you):
Once we have got our best parameters (or we simply try Jahrer’s ones), we are ready to train and predict. Our strategy, as suggested by the best solution, is to train a model on each cross-validation fold and use that fold to contribute to an average of test predictions. The snippet of code will produce both the test predictions and the out-of-fold predictions on the training dataset, which will be useful for figuring out how to ensemble the results:
preds = np.zeros(len(test))
oof = np.zeros(len(train))
metric_evaluations = list()
skf = StratifiedKFold(n_splits=config.cv_folds, shuffle=True, random_state=config.random_state)
for idx, (train_idx, valid_idx) in enumerate(skf.split(train,
target)):
print(f"CV fold {idx}")
X_train, y_train = train.iloc[train_idx], target.iloc[train_idx]
X_valid, y_valid = train.iloc[valid_idx], target.iloc[valid_idx]
model = lgb.LGBMClassifier(**params,
n_estimators=config.n_estimators,
early_stopping_round=config.early_stopping_round,
force_row_wise=True)
callbacks=[lgb.early_stopping(stopping_rounds=150),
lgb.log_evaluation(period=100, show_stdv=False)]
model.fit(X_train, y_train,
eval_set=[(X_valid, y_valid)],
eval_metric=gini_lgb, callbacks=callbacks)
metric_evaluations.append(
model.best_score_['valid_0']['normalized_gini_coef'])
preds += (model.predict_proba(test,
num_iteration=model.best_iteration_)[:,1]
/ skf.n_splits)
oof[valid_idx] = model.predict_proba(X_valid,
num_iteration=model.best_iteration_)[:,1]
The model training shouldn’t take too long. In the end you can get the reported Normalized Gini Coefficient obtained during the cross-validation procedure:
print(f"LightGBM CV normalized Gini coefficient:
{np.mean(metric_evaluations):0.3f}
({np.std(metric_evaluations):0.3f})")
The results are quite encouraging because the average score is 0.289 and the standard deviation of the values is quite small:
LightGBM CV Gini Normalized Score: 0.289 (0.015)
All that is left is to save the out-of-fold and test predictions as a submission and to verify the results on the public and private leaderboards:
submission['target'] = preds
submission.to_csv('lgb_submission.csv')
oofs = pd.DataFrame({'id':train_index, 'target':oof})
oofs.to_csv('lgb_oof.csv', index=False)
The obtained public score should be around 0.28442. The associated private score is about 0.29121, placing you in the 29th position on the final leaderboard. A quite good result, but we still have to blend it with a different model, a neural network.
Bagging the training set (i.e., taking multiple bootstraps of the training data and training multiple models based on the bootstraps) should increase the performance, although, as Michael Jahrer himself noted in his post, not all that much.