In the previous section on the scaling dataset, we learned that optimization is slow when the input data is not scaled (that is, it is not between zero and one).
The hidden layer value could be high in the following scenarios:
- Input data values are high
- Weight values are high
- The multiplication of weight and input are high
Any of these scenarios can result in a large output value on the hidden layer.
Note that the hidden layer is the input layer to output layer. Hence, the phenomenon of high input values resulting in a slow optimization holds true when hidden layer values are large as well.
Batch normalization comes to the rescue in this scenario. We have already learned that, when input values are high, we perform scaling to reduce the input values. Additionally, we have learned that scaling can also be performed using...