Generating a synthetic dataset for anomaly detection experiments
In this recipe, we will generate a synthetic dataset that contains outliers or anomalies. This will enable us to perform anomaly detection experiments using algorithms such as the Random Cut Forest (RCF). If this is your first time hearing about anomaly detection, it is the identification of outliers or records that differ significantly from the rest of the records of the dataset. What's the RCF algorithm? The RCF algorithm is an unsupervised algorithm used for detecting these anomalies in the dataset.
After we have generated the synthetic dataset in this recipe, we will use the generated dataset to train and deploy an RCF model and trigger this model within an Amazon Athena query in the Invoking machine learning models with Amazon Athena using SQL queries recipe. This will enable us to tag anomalies in our dataset during the data preparation and analysis phase.
Tip
Since we will show the steps on how to...