Evaluation
Evaluating transformers involves considering multiple classes of metrics and understanding the cost tradeoffs among these classes. Let’s see the main ones.
Quality
The quality of transformers can be measured against a number of generally available datasets. Let’s see the most commonly used ones.
GLUE
The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. GLUE is available at https://gluebenchmark.com/.
GLUE consists of:
- A benchmark of nine sentence or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty
- A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language
- A public leaderboard for tracking...