Implementing rule-based anomaly detection
It's time to catch those hackers. After the EDA in the previous section, we have an idea of how we might go about this. In practice, this is much more difficult to do, as it involves many more dimensions, but we have simplified it here. We want to find the IP addresses with excessive amounts of attempts accompanied by low success rates, and those attempting to log in with more unique usernames than we would deem normal (anomalies). To do this, we will employ threshold-based rules as our first foray into anomaly detection; then, in Chapter 11, Machine Learning Anomaly Detection, we will explore a few machine learning techniques as we revisit this scenario.
Since we are interested in flagging IP addresses that are suspicious, we are going to arrange the data so that we have hourly aggregated data per IP address (if there was activity for that hour):
>>> hourly_ip_logs = log.assign( ... failures...