ROUGE metric evaluation
A summary that's generated by a model should be readable, coherent, and factually correct. In addition, it should be grammatically correct. Human evaluation of summaries can be a mammoth task. If a person took 30 seconds to evaluate one summary in the Gigaword dataset, then it would take over 26 hours for one person to check the validation set. Since abstractive summaries are being generated, this human evaluation work will need to be done every time summaries are produced. The ROUGE metric tries to measure various aspects of an abstractive summary. It is a collection of four metrics:
- ROUGE-N is the n-gram recall between a generated summary and the ground truth or reference summary. "N" at the end of the name specifies the length of the n-gram. It is common to report ROUGE-1 and ROUGE-2. The metric is calculated as the ratio of matching n-grams between the ground truth summary and the generated summary, divided by the total...