As we have seen, it is best to normalize the data to remove obvious movie- or user-specific effects. We will just use one very simple type of normalization that we used before: conversion to z-scores.
Unfortunately, we cannot simply use scikit-learn's normalization objects as we have to deal with the missing values in our data (that is, not all movies were rated by all users). Thus, we want to normalize by the mean and standard deviation of the values that are, in fact, present.
We will write our own class that will ignore missing values. This class will follow the scikit-learn preprocessing API. We can even derive from scikit-learn's TransformerMixin class to add a fit_transform method:
from sklearn.base import TransformerMixin class NormalizePositive(TransformerMixin):
We want to choose the axis of normalization. By default, we normalize...