Our data
First, we will discuss the data we will use for text generation and various preprocessing steps employed to clean the data.
About the dataset
First, we will understand what the dataset looks like so that when we see the generated text, we can assess whether it makes sense, given the training data. We will download the first 100 books from the website https://www.cs.cmu.edu/~spok/grimmtmp/. These are translations of a set of books (from German to English) by the Grimm brothers.
Initially, we will download all 209 books from the website with an automated script, as follows:
url = 'https://www.cs.cmu.edu/~spok/grimmtmp/'
dir_name = 'data'
def download_data(url, filename, download_dir):
"""Download a file if not present, and make sure it's the right
size."""
# Create directories if doesn't exist
os.makedirs(download_dir, exist_ok=True)
# If file doesn't exist download...