At a high level, splitting the dataset into training and testing data in order to obtain a principled estimate of the system's performance is performed in the same way that we saw in previous chapters: we take a certain fraction of our data points (we will use 10 percent) and reserve them for testing; the rest will be used for training.
However, because the data is structured differently in this context, the code is different. In some of the models we explore, setting aside 10 percent of the users would not work as we transfer the data.
The first step is to load the data from the disk, for which we use the following function:
def load(): import numpy as np from scipy import sparse data = np.loadtxt('data/ml-100k/u.data') ij = data[:, :2] ij = 1 # original data is in 1-based system values = data...