Defining loss and cost functions
Many machine learning problems can be expressed throughout a proxy function that measures the training error. The obvious implicit assumption is that, by reducing both training and validation errors, the accuracy increases, and the algorithm reaches its objective.
If we consider a supervised scenario (many considerations hold also for semi-supervised ones), with finite datasets X and Y:
We can define the generic loss function for a single data point as:
J is a function of the whole parameter set and must be proportional to the error between the true label and the predicted label.
A very important property of a loss function is convexity. In many real cases, this is an almost impossible condition; however, it's always useful to look for convex loss functions, because they can be easily optimized through the gradient descent method. We're going to discuss this topic in Chapter 10, Introduction...