In this chapter, we are going to develop an algorithm based on the Gaussian Distribution function using Spark ML. We will apply the algorithm to detect fraud in transactions data. This kind of algorithm can be applied toward building robust fraud detection solutions for financial institutions, such as banks, which handle great quantities of online transactions.
At the heart of the Gaussian Distribution, the function is the notion of an anomaly. The fraud detection problem is only a classification task but in a very narrow sense. It is a balanced supervised learning problem. The term balanced refers to the fact that the positives in the dataset are of a small number in relation to the negatives. On the other hand, an anomaly detection problem is typically not balanced. The dataset contains a significantly small number of anomalies (positives...