Updating Parameters
The purpose of neural network training is to find the parameters that minimize the value of the loss function. The problem is finding the optimal parameters—a process called optimization. Unfortunately, the optimization is difficult because the parameter space is very complicated, and the optimal solution is difficult to find. You cannot do this by solving an equation to obtain the minimum value immediately. In a deep network, it is more difficult because the number of parameters is huge.
So far, we have depended on the gradients (derivatives) of the parameters to find the optimal parameters. By repeatedly using the gradients of the parameters to update the parameters in the gradient direction, we approach the optimal parameters gradually. This is a simple method called stochastic gradient descent (SGD), but it is a "smarter" method than searching the parameter space randomly. However, SGD is a simple method, and (for some problems) there are...