In this section, we will learn about several adaptive versions of gradient descent.
Adaptive methods of gradient descent
Setting a learning rate adaptively using Adagrad
When we build a deep neural network, we have many parameters. Parameters are basically the weights of the network, so when we build a network with many layers, we will have many weights, say, . Our goal is to find the optimal values for all these weights. In all of the previous methods we learned about, the learning rate was a common value for all the parameters of the network. However Adagrad (short for adaptive gradient) adaptively sets the learning rate according to a parameter.
Parameters that have frequent updates or high gradients will have a slower...