Developing a fraud analytics model
Before we fully start, we need to do two things: know the dataset, and then prepare our programming environment.
Description of the dataset and using linear models
For this project, we will be using the credit card fraud detection dataset from Kaggle. The dataset can be downloaded from https://www.kaggle.com/dalpozz/creditcardfraud. Since I am using the dataset, it would be a good idea to be transparent by citing the following publication:
- Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson, and Gianluca Bontempi, Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015.
The datasets contain transactions made by credit cards by European cardholders in September 2013 over the span of only two days. There is a total of 285,299 transactions, with only 492 frauds out of 284,807 transactions, meaning the dataset is highly imbalanced and the positive class (fraud) accounts...