Usually, a small percentage of people who see an advertisement click on it. In other words, the percentage of samples in a positive class in such an instance can be just 1% or even less. This makes it hard to predict the click-through rate (CTR) since the training data is highly imbalanced. In this section, we are going to use a highly imbalanced dataset from the Knowledge Discovery in Databases (KDD) Cup.
The KDD Cup is an annual competition organized by the ACM Special Interest Group on Knowledge Discovery and Data Mining. In 2012, they released a dataset for the advertisements shown alongside the search results in a search engine. The aim of the competitors was to predict whether a user will click on each ad or not. A modified version of the data has been published on the OpenML platform (https://www.openml.org/d/1220). The CTR in the modified dataset is 16.8%. This is our positive class. We can also call it the minority class since...