Evaluation metrics
To perform evaluations on your AI system, you must combine your evaluation data with an evaluation metric. An evaluation metric takes the input and the output of an AI system and returns a score measuring how the AI system performed for the case.
Evaluation metrics typically return scores between 0 and 1. The metric is called a binary metric if it returns only the scores of 0 or 1. The metric is called a normalized metric if it returns a score between 0 and 1, inclusive. Binary metrics clearly determine if the case passes or fails, 0 being fail and 1 being pass. Normalized metrics present a more nuanced view of how the AI system performs, but that nuance can lack interpretability. To add clarity to normalized metrics, you can set a minimum threshold score that the metric must return to be considered a pass. For example, say the metric Foo
returns a score of 0.6
for an evaluation case and 0.7
for another. If you have a threshold of 0.65, then the 0.6
score is considered...