We started off this chapter by learning about what convex and non-convex functions are. Then, we explored how we can find the minimum of a function using gradient descent. We learned how gradient descent minimizes a loss function by computing optimal parameters through gradient descent. Later, we looked at SGD, where we update the parameters of the model after iterating through each and every data point, and then we learned about mini-batch SGD, where we update the parameters after iterating through a batch of data points.
Going forward, we learned how momentum is used to reduce oscillations in gradient steps and attain convergence faster. Following this, we understood Nesterov momentum, where, instead of calculating the gradient at the current position, we calculate the gradient at the position the momentum will take us to.
We also learned about the Adagrad method, where...