Defining anomaly detection
Let's start by creating an understanding of what anomaly detection is. Also called outlier detection, anomaly detection is the process of identifying rare observations in a dataset. Those rare observations are called outliers or anomalies.
The goal of anomaly detection is to build models that can automatically detect outliers using statistical methods and/or machine learning. Such models can use multiple variables to see whether an observation should be considered an outlier or not.
Are outliers a problem?
Outliers occur in many datasets. After all, if you consider a variable that follows a normal distribution, it is normal to see data points far away from the mean. Let's consider a standard normal distribution (a normal distribution with mean 0
and standard deviation 1
):
Code Block 5-1
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
x = ...