Optimization algorithms
When discussing the back-propagation algorithm, we have shown how the SGD strategy can be easily employed to train deep networks with large datasets. This method is quite robust and effective; however, the function to optimize is generally non-convex and the number of parameters is extremely large. These conditions increase dramatically the probability to find saddle points (instead of local minima) and can slow down the training process when the surface is almost flat.
A common result of applying a vanilla SGD algorithm to these systems is shown in the following diagram:
Instead of reaching the optimal configuration, θopt, the algorithm reaches a sub-optimal parameter configuration, θsubopt, and loses the ability to perform further corrections. To mitigate all these problems and their consequences, many SGD optimization algorithms have been proposed, with the purpose of speeding up the convergence (also when the gradients become extremely small) and avoiding the instabilities...