Building a fraud detection model
For this project, we are going to use the credit card dataset from Kaggle (https://www.kaggle.com/mlg-ulb/creditcardfraud), Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015. It consists of credit card transaction data from two days, from European cardholders. The dataset is highly imbalanced and contains approximately 284,000 pieces of transaction data with 492 instances of fraud (0.172% of the total).
There are 31 numerical columns in the dataset. Two of them are time and amount. Time denotes the amount of time elapsed (in seconds) between each transaction and the first transaction in the dataset. Amount is the total amount regarding the transaction. For our model, we will eliminate the time column as it doesn't help with the accuracy of the model. The rest of the features (V1, V2...