Forecasting with Transformers
For some continuity, we will continue with the same household we were forecasting with RNNs and RNNs with attention.
Notebook alert
To follow along with the complete code, use the notebook named 03-Transformers.ipynb
in the Chapter14
folder and the code in the src
folder.
Although we learned about the vanilla Transformer as a model with an encoder-decoder architecture, it was really designed for language translation tasks. In language translation, the source sequence and target sequence are quite different, and therefore the encoder-decoder architecture made sense. But soon after, researchers figured out that using the decoder part of the Transformer alone does well. It is called a decoder-only Transformer in literature. The naming is a bit confusing because if you think about it, the decoder is different from the encoder in two things—masked self-attention and encoder-decoder attention. So, in a decoder-only Transformer, how do we have...