ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) is yet another interesting variant of BERT. We learned that we pre-train BERT using the MLM and NSP tasks. We know that in the MLM task, we randomly mask 15% of the tokens and train BERT to predict the masked token. Instead of using the MLM task as a pre-training objective, ELECTRA is pre-trained using a task called replaced token detection.
The replaced token detection task is very similar to MLM but instead of masking a token with the [MASK] token, here we replace a token with a different token and train the model to classify whether the given tokens are actual or replaced tokens.
Okay, but why use the replaced token detection task instead of the MLM task? One of the problems with the MLM task is that it uses the [MASK] token during pre-training but the [MASK] token will not be present during fine-tuning on downstream tasks. This causes a mismatch between pre-training and...