With our additional data, we should revisit our EDA to make sure our plan of looking at the number of usernames with failures on a minute resolution does separate the data. After some data wrangling in the 3-EDA_labeled_data.ipynb notebook, we are able to create the following scatter plot, which shows that this strategy does indeed appear to separate the suspicious activity:
In the 4-supervised_anomaly_detection.ipynb notebook, we will create some supervised models. Before we build our models, however, let's create a new function that will create both X and y at the same time. The get_X_y() function will use the get_X() and get_y() functions we made earlier, returning both X and y:
def get_X_y(log, day, hackers):
"""
Get the X, y data to build a model with.
Parameters:
- log: The logs dataframe
- day: A day or single...