Introducing transformers
Before we learn about ViTs, let us understand transformers from an NLP perspective. A transformer helps in generating a representation (word/vector embedding) that best describes the word given its context (surrounding words). Some of the major limitations of recurrent neural networks (RNNs) and long short-term memory (LSTM) architecture (detailed information about which is provided in the associated GitHub repository) are:
- A word embedding corresponding to a word is not dependent on the context in which the word appears (the word apple will have the same embedding irrespective of whether the context is about the fruit or the company).
- Hidden state calculation during training happens sequentially (a word’s hidden state is dependent on the previous word’s hidden state and thus can only be calculated after the previous hidden state is calculated), resulting in a considerable time taken to process text.
Transformers address...