Batch normalization
Let's consider a mini-batch of k samples:
Before traversing the network, we can measure a mean and a variance:
After the first layer (for simplicity, let's suppose that the activation function, f(•), is the always the same), the batch is transformed into the following:
In general, there's no guarantee that the new mean and variance are the same. On the contrary, it's easy to observe a modification that increases throughout the network. This phenomenon is called covariate shift, and it's responsible for a progressive training speed decay due to the different adaptations needed in each layer. Ioffe and Szegedy (in Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Ioffe S., Szegedy C., arXiv:1502.03167 [cs.LG]) proposed a method to mitigate this problem, which has been called batch normalization (BN).
The idea is to renormalize the linear output of a layer (before or after applying the activation function), so that the batch has null...