SGD, in contrast to batch gradient descent, performs a parameter update for each training example, x(i) and label y(i):
Θ = Θ - η∇Θj(Θ, x(i), y(i))
Adaptive Moment Estimation (Adam) computes adaptive learning rates for each parameter. Like AdaDelta, Adam not only stores the decaying average of past squared gradients but additionally stores the momentum change for each parameter. Adam works well in practice and is one of the most used optimization methods today.
Adam stores the exponentially decaying average of past gradients (mt) in addition to the decaying average of past squared gradients (like Adadelta and RMSprop). Adam behaves like a heavy ball with friction running down the slope leading to a flat minima in the error surface. Decaying averages of past and past squared...