Let's recap on gradient descent by answering the following questions:
- How does SGD differ from vanilla gradient descent?
- Explain mini-batch gradient descent.
- Why do we need momentum?
- What is the motivation behind NAG?
- How does Adagrad set the learning rate adaptively?
- What is the update rule of Adadelta?
- How does RMSProp overcome the limitations of Adagrad?
- Define the update equation of Adam.