Working with GLMs
The Transformer architecture was originally intended to be effective for text-to-text tasks such as MT and summarization, but it has been used in diverse NLP problems, ranging from token classification to coreference resolution. Subsequent works began to use the left and right parts of the architecture separately and more creatively. Other than next-word prediction, some denoising objectives are also utilized to fully recover the original input from corrupted or truncated input.
While text-to-text models such as T5 use both encoder and decoder parts, decoder-only models such as GPT uses only decoder part. Decoder-only AR prevents the model from accessing words to the right of the current word in a forward direction (or to the left of the current word in a backward direction), which is called unidirectionality.
GPT and its successors (GPT-2, GPT-3, InstructGPT (a.k.a GPT-3.5), and ChatGPT), Transformer-XL, and XLNet are some of the popular decoder-only AR models...