Batch normalization
Let's consider a mini-batch containing k data points:
data:image/s3,"s3://crabby-images/bab51/bab51cc11adc439e3a0bd8b93b16937e378e64a8" alt=""
Before traversing the network, we can measure the sample mean and variance:
data:image/s3,"s3://crabby-images/1b641/1b641002ed24e2c12c9ba1159a29ffbcd7809a60" alt=""
After the first layer (for simplicity, let's suppose that the activation function, fa(x), is always the same), the batch is transformed into the following:
data:image/s3,"s3://crabby-images/feaec/feaece4f00463ee81ec7a4e1042af6c813af5433" alt=""
In general, there's no guarantee that the new mean and variance are the same. On the contrary, it's easy to observe a modification that increases throughout the network. This phenomenon is called covariate shift, and it's responsible for a progressive training speed decay due to the different adaptations needed in each layer. Ioffe and Szegedy (in Ioffe S., Szegedy C., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, arXiv:1502.03167 [cs.LG]) proposed a method to mitigate this problem, which is called batch normalization (BN).
The idea is to renormalize the linear output of a layer (before...