In this section, we will learn about two new variants of gradient descent, called momentum and Nesterov accelerated gradient.
Momentum-based gradient descent
Gradient descent with momentum
We have a problem with SGD and mini-batch gradient descent due to the oscillations in the parameter update. Take a look at the following plot, which shows how mini-batch gradient descent is attaining convergence. As you can see, there are oscillations in the gradient steps. The oscillations are shown by the dotted line. As you may notice, it is making a gradient step toward one direction, and then taking a different direction, and so on, until it reaches convergence:
This oscillation occurs because, since we update the parameters after...