Chapter 8, Matching Tokenizers and Datasets
- A tokenized dictionary contains every word that exists in a language. (True/False)
False.
- Pretrained tokenizers can encode any dataset. (True/False)
False.
- It is good practice to check a database before using it. (True/False)
True.
- It is good practice to eliminate obscene data from datasets. (True/False)
True.
- It is a good practice to delete data containing discriminating assertions. (True/False)
True.
- Raw datasets might sometimes produce relationships between noisy content and useful content. (True/False)
True.
- A standard pretrained tokenizer contains the English vocabulary of the past 700 years. (True/False)
False.
- Old English can create problems when encoding data with a tokenizer trained in modern English. (True/False)
True.
- Medical and other types of jargon can create problems when encoding data with a tokenizer trained in modern English...