Evaluating summaries
When people write summaries, they use inventive language. Human-written summaries often use words that are not present in the vocabulary of the text being summarized. When models generate abstractive summaries, they may also use words that are different from the words used in the ground truth summaries provided. There is no real way to do an effective semantic comparison of the ground truth summary and the generated summary. In summarization problems, a human evaluation step is often involved, which is where a qualitative check of the generated summaries is done. This method is both unscalable and expensive. There are approximations that uses n-gram overlaps and the longest common subsequence matches after stemming and lemmatization. The hope is that such pre-processing helps bring ground truth and generated summaries closer together for evaluation. The most common metric used for evaluating summaries is Recall-Oriented Understudy for Gisting Evaluation, also...