Misleading results – data leaking
In the training process, we use one set of data and in the test set, we use another set. The best training process is when these two datasets are separate. If they are not, we get into something that is called a data leak problem. This problem is when we have the same data points in both the train and test sets. Let’s illustrate this with an example.
First, we need to create a new split, where we have some data points in both sets. We can do that by using the split function and setting 20% of the data points to the test set. This means that at least 10% of the data points are in both sets:
X_trainL, X_testL, y_trainL, y_testL = \ sklearn.model_selection.train_test_split(X, y, random_state=42, train_size=0.8)
Now, we can use the same code to make predictions on this data and then calculate the performance metrics:
# now, let's evaluate the model on this new data with torch...