In machine learning, a regularization term, R(P), computed over the parameters, P, of the method, f, to optimize (for instance, a neural network) can be added to the loss function, L, before training, as follows:
Here,  is a factor controlling the strength of the regularization (typically, to scale down the amplitude of the regularization term compared to the main loss), and y = f(x, P) is the output of the method, f, parametrized by P for the input data, x. By adding this term, R(P), to the loss, we force the network not only to optimize its task, but to optimize it while constraining the values its parameters can take.
For L1 and L2 regularization, the respective terms are as follows:
L2 regularization (also called ridge regularization) thus compels the network to minimize the sum of its squared parameter values. While this regularization leads to the decay of all parameter values over the optimization process, it more strongly...