Working with Seq2Seq models
The left encoder and the right decoder part of the transformer are connected with cross-attention, which helps each decoder layer attend over the final encoder layer. This naturally pushes models toward producing output that closely ties to the original input. A Seq2Seq model, which is the original transformer, achieves this by using the following scheme:
Input tokens-> embeddings-> encoder-> decoder-> output tokens
Seq2Seq models keep the encoder and decoder part of the transformer. T5, Bidirectional and Auto-Regressive Transformer (BART), and Pre-training with Extracted Gap-sentences for Abstractive Summarization Sequence-to-Sequence models (PEGASUS) are among the popular Seq2Seq models.
T5
Most NLP architectures, ranging from Word2Vec to transformers learn embeddings and other parameters by predicting the masked words using context (neighbor) words. We treat NLP problems as word prediction problems. Some studies cast almost all...