Evaluating machine translation with BLEU
Papineni et al. (2002) came up with an efficient way to evaluate a human translation. The human baseline was difficult to define. However, they realized that we could obtain efficient results if we compared human translation with machine translation, word for word.
Papineni et al. (2002) named their method the Bilingual Evaluation Understudy Score (BLEU).
In this section, we will use the Natural Language Toolkit (NLTK) to implement BLEU:
http://www.nltk.org/api/nltk.translate.html#nltk.translate.bleu_score.sentence_bleu
We will begin with geometric evaluations.
Geometric evaluations
The BLEU method compares the parts of a candidate sentence to a reference sentence or several reference sentences.
Open BLEU.py
, which is in the chapter directory of the GitHub repository of this book.
The program imports the nltk
module:
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import...