Data tokenization and vectorization
The Gigaword dataset has been already cleaned, normalized, and tokenized using the StanfordNLP tokenizer. All the data is converted into lowercase and normalized using the StanfordNLP tokenizer, as seen in the preceding examples. The main task in this step is to create a vocabulary. A word-based tokenizer is the most common choice in summarization. However, we will use a subword tokenizer in this chapter. A subword tokenizer provides the benefit of limiting the size of the vocabulary while minimizing the number of unknown words. Chapter 3, Named Entity Recognition (NER) with BiLSTMs, CRFs, and Viterbi Decoding, on BERT, described different types of tokenizers. Consequently, models such specifically the part as BERT and GPT-2 use some variant of a subword tokenizer. The tfds
package provides a way for us to create a subword tokenizer, initialized from a corpus of text. Since generating the vocabulary requires running it over all of the training data...