Classifying images with Vision Transformer
Vision Transformer (ViT, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, https://arxiv.org/abs/2010.11929) proves the adaptability of the attention mechanism by introducing a clever technique for processing images. One way to use transformers for image inputs is to encode each pixel with four variables – pixel intensity, row, column, and channel location. Each pixel encoding is an input to a simple neural network (NN), which outputs a -dimensional embedding vector. We can represent the three-dimensional image as a one-dimensional sequence of these embedding vectors. It acts as an input to the model in the same way as a token embedding sequence does. Each pixel will attend to every other pixel in the attention blocks.
This approach has some disadvantages related to the length of the input sequence (context window). Unlike a one-dimensional text sequence, an image has a two-dimensional structure (the color...