Implementing supervised anomaly detection
The SOC has finished up labeling the 2018 data, so we should revisit our EDA to make sure our plan of looking at the number of usernames with failures on a minute resolution does separate the data. This EDA is in the 3-EDA_labeled_data.ipynb
notebook. After some data wrangling, we are able to create the following scatter plot, which shows that this strategy does indeed appear to separate the suspicious activity:
In the 4-supervised_anomaly_detection.ipynb
notebook, we will create some supervised models. This time we need to read in all the labeled data for 2018. Note that the code for reading in the logs is omitted since it is the same as in the previous section:
>>> with sqlite3.connect('logs/logs.db') as conn: ...     hackers_2018 = pd.read_sql( ...      ...