As the ResNet authors pointed out, the degradation phenomenon would not happen if layers could easily learn identity mapping (that is, if a set of layers could learn weights so that their series of operations finally return the same tensors as the input layers).
Indeed, the authors argue that, when adding some layers on top of a CNN, we should at least obtain the same training/validation errors if these additional layers were able to converge to the identity function. They would learn to at least pass the result of the original network without degrading it. Since that is not the case—as we can often observe a degradation—it means that identity mapping is not easy to learn for CNN layers.
This led to the idea of introducing residual blocks, with two paths:
- One path further processes the data with some additional convolutional layers
- One path performs the identity mapping (that is, forwarding the data with no changes)