The first step is to retrieve the corpora. We've already seen how to do this, but let's now formalize it in a function. To make it generic enough, let's enclose these functions in a file named corpora_tools.py.
- Let's do some imports that we will use later on:
import pickle
import re
from collections import Counter
from nltk.corpus import comtrans
- Now, let's create the function to retrieve the corpora:
def retrieve_corpora(translated_sentences_l1_l2='alignment-de-en.txt'):
print("Retrieving corpora: {}".format(translated_sentences_l1_l2))
als = comtrans.aligned_sents(translated_sentences_l1_l2)
sentences_l1 = [sent.words for sent in als]
sentences_l2 = [sent.mots for sent in als]
return sentences_l1, sentences_l2
This function has one argument; the file containing the aligned sentences from...