Let's evaluate our newly acquired knowledge by answering the following questions:
- How does the SOP task differ from the NSP task?
- What are the parameter reduction techniques used in ALBERT?
- What is cross-layer parameter sharing?
- What are the shared feedforward and shared attention options in cross-layer parameter sharing?
- How does RoBERTa differ from the BERT model?
- What is the replaced token detection task in ELECTRA?
- How do we mask tokens in SpanBERT?