Automatic differentiation for loss minimization
Recall from our previous discussion that to fit a predictive model to a training dataset, we first analyze an appropriate loss function, derive the gradient of this loss, and then adjust the parameters of the model in the opposite direction of the gradient to achieve a lower loss. This procedure is only possible if we have access to the derivative of the loss function.
Earlier machine learning models were able to do this because researchers derived the derivatives of common loss functions by hand using calculus, which were then hardcoded into the training algorithm so that a loss function could be minimized. Unfortunately, taking the derivative of a function could be difficult to do at times, especially if the loss function being used is not well behaved. In the past, you would have to choose a different, more mathematically convenient loss function to make your model run even if the new function was less appropriate, potentially sacrificing...