Rather than jumping straight into the available algorithms in scikit-learn, let's start by thinking about ways to detect the anomalous samples. Imagine measuring the traffic to your website every hour, which gives you the following numbers:
hourly_traffic = [
120, 123, 124, 119, 196,
121, 118, 117, 500, 132
]
Looking at these numbers, 500 sounds quite high compared to the others. Formally speaking, if the hourly traffic data is assumed to be normally distributed, then 500 is further away from its mean or expected value. We can measure this by calculating the mean of these numbers and then checking the numbers that are more than 2 or 3 standard deviations away from the mean. Similarly, we can calculate a high quantile and check which numbers are above it. Here, we find the values above the 95th percentile:
pd.Series(hourly_traffic) > pd.Series(hourly_traffic).quantile(0.95)
This code will give...