The data
The dataset we will work with is a synthetic dataset of transactions generated by a payment simulator. The goal of this case study and the focus of this chapter is to find fraudulent transactions within a dataset, a classic machine learning problem many financial institutions deal with.
Note
Note: Before we go further, a digital copy of the code, as well as an interactive notebook for this chapter are accessible online, via the following two links:
An interactive notebook containing the code for this chapter can be found under https://www.kaggle.com/jannesklaas/structured-data-code
The code can also be found on GitHub, in this book's repository: https://github.com/PacktPublishing/Machine-Learning-for-Finance
The dataset we're using stems from the paper PaySim: A financial mobile money simulator for fraud detection, by E. A. Lopez-Rojas, A. Elmir, and S. Axelsson. The dataset can be found on Kaggle under this URL: https://www.kaggle.com/ntnu-testimon/paysim1.
Before we break it down...