Dataset
Before we explain the model part, let us start by processing the text corpus by creating the vocabulary and integrating the text with it so that each word is represented as an integer. As a dataset, any text corpus can be used, such as Wikipedia or web articles, or posts from social networks such as Twitter. Frequently used datasets include PTB, text8, BBC, IMDB, and WMT datasets.
In this chapter, we use the text8
corpus. It consists of a pre-processed version of the first 100 million characters from a Wikipedia dump. Let us first download the corpus:
wget http://mattmahoney.net/dc/text8.zip -O /sharedfiles/text8.gz gzip -d /sharedfiles/text8.gz -f
Now, we construct the vocabulary and replace the rare words with tokens for UNKNOWN. Let us start by reading the data into a list of strings:
Read the data into a list of strings:
words = [] with open('data/text8') as fin: for line in fin: words += [w for w in line.strip().lower().split()] data_size = len(words) print('Data size:...