Seq2seq for translation
Sequence-to-sequence (Seq2seq) networks have their first application in language translation.
A translation task has been designed for the conferences of the Association for Computational Linguistics (ACL), with a dataset, WMT16, composed of translations of news in different languages. The purpose of this dataset is to evaluate new translation systems or techniques. We'll use the German-English dataset.
First, preprocess the data:
python 0-preprocess_translations.py --srcfile data/src-train.txt --targetfile data/targ-train.txt --srcvalfile data/src-val.txt --targetvalfile data/targ-val.txt --outputfile data/demo First pass through data to get vocab... Number of sentences in training: 10000 Number of sentences in valid: 2819 Source vocab size: Original = 24995, Pruned = 24999 Target vocab size: Original = 35816, Pruned = 35820 (2819, 2819) Saved 2819 sentences (dropped 181 due to length/unk filter) (10000, 10000) Saved 10000 sentences (dropped 0 due to length/unk filter...