Chapter 3, Pretraining a RoBERTa Model from Scratch
- RoBERTa uses a byte-level byte-pair encoding tokenizer. (True/False)
True.
- A trained Hugging Face tokenizer produces
merges.txt
andvocab.json
. (True/False)True.
- RoBERTa does not use token type IDs. (True/False)
True.
- DistilBERT has 6 layers and 12 heads. (True/False)
True.
- A transformer model with 80 million parameters is enormous. (True/False)
False. 80 million parameters is a small model.
- We cannot train a tokenizer. (True/False)
False. A tokenizer can be trained.
- A BERT-like model has 6 decoder layers. (True/False)
False. BERT contains 6 encoder layers, not decoder layers.
- Masked language modeling predicts a word contained in a mask token in a sentence. (True/False)
True.
- A BERT-like model has no self-attention sub-layers. (True/False)
False. BERT has self-attention layers.
- Data collators are helpful for...