Evaluation paradigms
In this section, we will review some of the major evaluation paradigms that are used to quantify system performance and compare systems.
Comparing system results on standard metrics
This is the most common evaluation paradigm and probably the easiest to carry out. The system is simply given data to process, and its performance is evaluated quantitatively based on standard metrics. The upcoming Evaluation metrics section will delve into this topic in much greater detail.
Evaluating language output
Some NLU applications produce natural language output. These include applications such as translation or summarizing text. They differ from applications with a specific right or wrong answer, such as classification and slot filling, because there is no single correct answer – there could be many good answers.
One way to evaluate machine translation quality is for humans to look at the original text and the translation and judge how accurate it is...