Preprocessing a WMT dataset
Vaswani et al. (2017) present the Transformer's achievements on the WMT 2014 English-to-German translation task and the WMT 2014 English-to-French translation task. The Transformer achieves a state-of-the-art BLEU score. BLEU will be described in the Evaluating machine translation with BLEU section of this chapter.
The 2014 Workshop on Machine Translation (WMT) contained several European language datasets. One of the datasets contained data taken from version 7 of the Europarl corpus. We will be using the French-English dataset from the European Parliament Proceedings Parallel Corpus 1996-2011. The link is https://www.statmt.org/europarl/v7/fr-en.tgz.
Once you have downloaded the files and have extracted them, we will preprocess the two parallel files:
europarl-v7.fr-en.en
europarl-v7.fr-en.fr
We will load, clear, and reduce the size of the corpus.
Let's start the preprocessing.