Finding your pretraining loss function
We introduced this topic in Chapter 1 as a pretraining objective, or in vision as a pretext task. Remember that these are essentially different words for the same thing: the mathematical quantity your model will optimize for while performing self-supervised learning. This is valuable because it opens you up to a plethora of unsupervised data, which is, on average, more available than supervised data. Usually, this pretraining function injects some type of noise and then tries to learn what the real data patterns look like from the false ones (causal language modeling as with GPT). Some functions inject masks and learn how to predict which words have been masked (masked language modeling as with BERT). Others substitute some words with reasonable alternatives that reduce the overall size of the needed dataset (token detection as with DeBERTa).
Important note
When we pretrain our models, we use a pretraining loss function to create the ability...