Forecasting with Transformers
For some continuity, we will use the same household example we were forecasting with RNNs and RNNs with attention.
Notebook alert:
To follow along with the complete code, use the notebook named 03-Transformers.ipynb
in the Chapter14
folder and the code in the src
folder.
Although we learned about the vanilla Transformer as a model with an encoder-decoder architecture, it was really designed for language translation tasks. In language translation, the source sequence and target sequence are quite different, and therefore the encoder-decoder architecture made sense. But soon after, researchers figured out that using the decoder part of the Transformer alone does well. It is called a decoder-only Transformer in literature. The naming is a bit confusing because if you think about it, the decoder is different from the encoder in two ways—masked self-attention and encoder-decoder attention. So, in a decoder-only Transformer...