Our data
First, we will discuss the data we will use for text generation and various preprocessing steps employed to clean data.
About the dataset
First, we will understand what the dataset looks like so that when we see the generated text, we can assess whether it makes sense, given the training data. We will download the first 100 books from the website https://www.cs.cmu.edu/~spok/grimmtmp/. These are translations of a set of books (from German to English) by the Brothers Grimm. This is the same as the text used in Chapter 6, Recurrent Neural Networks, for demonstrating the performance of RNNs.
Initially, we will download the first 100 books from the website with an automated script, as follows:
url = 'https://www.cs.cmu.edu/~spok/grimmtmp/' # Create a directory if needed dir_name = 'stories' if not os.path.exists(dir_name): os.mkdir(dir_name) def maybe_download(filename): """Download a file if not present""" print('Downloading...