Further reading
Here are a few resources for further reading:
- The Illustrated Transformer by Jay Alammar: https://jalammar.github.io/illustrated-transformer/
- Transformer Networks: A mathematical explanation why scaling the dot products leads to more stable gradients: https://towardsdatascience.com/transformer-networks-a-mathematical-explanation-why-scaling-the-dot-products-leads-to-more-stable-414f87391500
- Why is Bahdanau’s attention sometimes called concat attention?: https://stats.stackexchange.com/a/524729
- Noam Shazeer (2020). GLU Variants Improve Transformer. arXiv preprint: Arxiv-2002.05202. https://arxiv.org/abs/2002.05202
- What is Residual Connection? by Wanshun Wong: https://towardsdatascience.com/what-is-residual-connection-efb07cab0d55
- Attn: Illustrated Attention by Raimi Karim: https://towardsdatascience.com/attn-illustrated-attention-5ec4ad276ee3