Now that we have an understanding of SGD and backpropagation, let's look at a number of advanced optimization methods (building on SGD) that offer us some kind of advantage, usually an improvement in training time (or the time it takes to minimize the cost function to the point where our network converges).
These improved methods include a general notion of velocity as an optimization parameter. Quoting from Wibisono and Wilson, in the opening to their paper on Accelerated Methods in Optimization:
"In convex optimization, there is an acceleration phenomenon in which we can boost the convergence rate of certain gradient-based algorithms."
In brief, a number of these advanced algorithms all rely on a similar principle—that they can pass through local optima quickly, carried by their momentum—essentially, a moving...