Researchers show that randomly initialized gradient descent can achieve zero training loss in deep learning

Yesterday, researchers from Carnegie Mellon University, University of Southern California, Peking University, and Massachusetts Institute of Technology published a paper on a big optimization problem in deep learning. This study proves that randomly initialized gradient descent can achieve zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet).

The key idea is to show that the Gram matrix is increasingly stable under overparameterization, and so every step of gradient descent decreases the loss at a geometric rate.

What is this study is based on?

This study builds on two ideas from previous works on gradient descent for two-layer neural networks:

The researchers analyzed the dynamics of the predictions whose convergence is determined by the least eigenvalue of the Gram matrix induced by the neural network architecture. And to lower bound the least eigenvalue, it is sufficient to bound the distance of each weight matrix from its initialization.

The second base concept is the observation by Li and Liang, which states that if the neural network is overparameterized, every weight matrix is close to its initialization.

What are the key observations made in this study?

This study focuses on the least squares loss and assumes the activation is Lipschitz and smooth. Consider that there are n data points and the neural network has H layers with width m.

The following are the aims this study tries to prove:

Fully-connected feedforward network: If m = Ω poly(n)2O(H)1, then randomly initialized gradient descent converges to zero training loss at a linear rate.

ResNet architecture: If m = Ω (poly(n, H)), then randomly initialized gradient descent converges to zero training loss at a linear rate. When compared with the first result, the dependence on the number of layers improves exponentially for ResNet. This theory demonstrates the advantage of using residual connections.

Convolutional ResNet: The same technique is used to analyze the convolutional ResNet. If m = poly(n, p, H) where p is the number of patches, then randomly initialized gradient descent achieves zero training loss.