In order to evaluate a text summarization task, we use a popular set of metrics called ROUGE, which stands for Recall-Oriented Understudy for Gisting Evaluation. First, we will understand how the ROUGE metric works, and then we will check the ROUGE score for text summarization with the BERTSUM model.
The ROUGE metric was first introduced in the paper ROUGE: A Package for Automatic Evaluation of Summaries by Chin-Yew Lin. The five different ROUGE evaluation metrics include the following:
- ROUGE-N
- ROUGE-L
- ROUGE-W
- ROUGE-S
- ROUGE-SU
We will focus only on ROUGE-N and ROUGE-L. First, let's understand how ROUGE-N is computed, and then we will look at ROUGE-L.
Understanding the ROUGE-N metric
ROUGE-N is an n-gram recall between a candidate summary (predicted summary) and a reference summary (actual summary).
The recall is defined as a ratio of the total number of overlapping n-grams between the candidate and reference summary to the total number...