Evaluating machine translation with BLEU
Papineni et al. (2002) came up with an efficient way to evaluate a human translation. The human baseline was difficult to define. However, they realized that if we compared human translation to machine translation word by word, we could obtain efficient results.
Papineni et al. (2002) named their method the Bilingual Evaluation Understudy Score (BLEU).
In this section, we will use the Natural Language Toolkit (NLTK) to implement BLEU:
http://www.nltk.org/api/nltk.translate.html#nltk.translate.bleu_score.sentence_bleu
We will begin with geometric evaluations.
Geometric evaluations
The BLEU method compares the parts of a candidate sentence to a reference sentence or several reference sentences.
Open BLEU.py
, which is in the chapter directory of the GitHub repository of this book.
The program imports the nltk
module:
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import...