Preparing the Data for the Two Languages
In Chapter 7, Implementing NLP Applications, we talked about the advantages and disadvantages of training neural networks at the character and word levels. As we already have some experience with the character level, we decided to also train this network for automatic translation at the character level.
To train a neural machine translation network, we need a dataset with bilingual sentence pairs for the two languages. Datasets for different language combinations can be downloaded for free at www.manythings.org/anki/. From there, we can download a dataset containing a number of sentences in English and German that are commonly used in everyday life. The dataset consists of two columns only: the original short text in English and the corresponding translation in German.
Figure 8.5 shows you a subset of this dataset to be used as the training set: