Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
The Kaggle Workbook

You're reading from   The Kaggle Workbook Self-learning exercises and valuable insights for Kaggle data science competitions

Arrow left icon
Product type Paperback
Published in Feb 2023
Publisher Packt
ISBN-13 9781804611210
Length 172 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Luca Massaron Luca Massaron
Author Profile Icon Luca Massaron
Luca Massaron
Konrad Banachewicz Konrad Banachewicz
Author Profile Icon Konrad Banachewicz
Konrad Banachewicz
Arrow right icon
View More author details
Toc

Understanding the competition and the data

Porto Seguro is the third largest insurance company in Brazil (it operates in Brazil and Uruguay), offering car insurance coverage as well as many other insurance products, having used analytical methods and machine learning for the past 20 years to tailor their prices and make auto insurance coverage more accessible to more drivers. To explore new ways to achieve their task, they sponsored a competition (https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction), expecting Kagglers to come up with new and better methods of solving some of their core analytical problems.

The competition is aimed at having Kagglers build a model that predicts the probability that a driver will initiate an auto insurance claim in the next year, which is a quite common kind of task (the sponsor mentions it as a “classical challenge for insurance”). This kind of information about the probability of filing a claim can be quite precious for an insurance company. Without such a model, insurance companies may only charge a flat premium to customers irrespective of their risk, or, if they have a poorly performing model, they may charge a mismatched premium to them. Inaccuracies in profiling the customers’ risk can therefore result in charging a higher insurance cost to good drivers and reducing the price for the bad ones. The impact on the company would be two-fold: good drivers will look elsewhere for their insurance and the company’s portfolio will be overweighed with bad ones (technically, the company would have a bad loss ratio: https://www.investopedia.com/terms/l/loss-ratio.asp). Instead, if the company can correctly estimate the claim likelihood, they can ask for a fair price from their customers, thus increasing their market share, having more satisfied customers and a more balanced customer portfolio (better loss ratio), and managing their reserves better (the money the company sets aside for paying claims).

To do so, the sponsor provided training and test datasets, and the competition was ideal for anyone since the dataset was not very large and was very well prepared.

As stated on the page of the competition devoted to presenting the data (https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/data):

Features that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc).

In addition, feature names include the postfix bin to indicate binary features and cat to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation. The target column signifies whether or not a claim was filed for that policy holder.

The data preparation for the competition was carefully conducted to avoid any leak of information, and although secrecy has been maintained about the meaning of the features, it is quite clear that the different used tags refer to specific kinds of features commonly used in motor insurance modeling:

  • ind refers to “individual characteristics”
  • car refers to “car characteristics”
  • calc refers to “calculated features”
  • reg refers to “regional/geographic features”

As for the individual features, there was much speculation about their meaning during the competition. See for instance:

In spite of all these and more efforts, in the end the meaning of most of the features has remained a mystery up until now.

The interesting facts about this competition are that:

  1. The data is real-world, though the features are anonymous.
  2. The data is very well prepared, without leakages of any sort (no magic features here – a magic feature is a feature that by skillful processing can provide high predictive power to your models in a Kaggle competition).
  3. The test dataset not only holds the same categorical levels as the training dataset; it also seems to be from the same distribution, although Yuya Yamamoto argues that preprocessing the data with t-SNE leads to a failing adversarial validation test (https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/44784).

Exercise 1

As a first exercise, referring to the contents and the code in The Kaggle Book related to adversarial validation (starting from page 179), prove that the training and test data most probably originated from the same data distribution.

Exercise Notes (write down any notes or workings that will help you):

An interesting post by Tilii (Mensur Dlakic, Associate Professor at Montana State University: https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/42197) demonstrates using t-SNE that “there are many people who are very similar in terms of their insurance parameters, yet some of them will file a claim and others will not.” What Tilii mentions is quite typical of what happens in insurance, where for certain priors (insurance parameters) there is the same probability of something happening, but that event will happen or not based on how long we observe the sequence of events.

Take, for instance, IoT and telematic data in insurance. It is quite common to analyze a driver’s behavior to predict if they will file a claim in the future. If your observation period is too short (for instance, one year, as in the case of this competition), it may happen that even very bad drivers won’t have a claim because there is a low probability that such an event will occur in a short period of time, even for a bad driver. Similar ideas are discussed by Andy Harless (https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/42735), who argues instead that the real task of the competition is to guess "the value of a latent continuous variable that determines which drivers are more likely to have accidents" because actually "making a claim is not a characteristic of a driver; it’s a result of chance."

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime