GenAI model evaluation
While evaluating traditional AI models usually involves measuring the model’s outputs against a ground truth dataset and calculating established, objective metrics such as accuracy, precision, recall, F1 score, Mean Squared Error (MSE), and others we’ve covered in this book, evaluating generative models is not always as straightforward.
When evaluating a GenAI model, we might want to focus on different factors, such as the model’s ability to produce creative and human-like outputs while staying relevant to the task at hand. An extra challenge in this case is that evaluating these properties can be somewhat vague and subjective. For example, if I ask a generative model to write a poem or create a photorealistic picture of a cat in a meadow, the quality of that poem or the realness of the picture is not always easy to represent with a mathematically calculated number, although some formulaic measurements do exist, which I will describe in...