Batch normalization
Let's consider a mini-batch containing k data points:
Before traversing the network, we can measure the sample mean and variance:
After the first layer (for simplicity, let's suppose that the activation function, fa(x), is always the same), the batch is transformed into the following:
In general, there's no guarantee that the new mean and variance are the same. On the contrary, it's easy to observe a modification that increases throughout the network. This phenomenon is called covariate shift, and it's responsible for a progressive training speed decay due to the different adaptations needed in each layer. Ioffe and Szegedy (in Ioffe S., Szegedy C., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, arXiv:1502.03167 [cs.LG]) proposed a method to mitigate this problem, which is called batch normalization (BN).
The idea is to renormalize the linear output of a layer (before...