The Transformer model
The Transformer model was discussed in Chapter 4, Transfer Learning with BERT. It was inspired by the seq2seq model and has an Encoder and a Decoder part. Since the Transformer model does not rely on RNNs, input sequences need to be annotated with positional encodings, which allow the model to learn about the relationships between inputs. Removing recurrence improves the speed of the model vastly while reducing the memory footprint. This innovation of the Transformer model has made very large-sized models such as BERT and GPT-3 possible. The Encoder part of the Transformer model was shown in the aforementioned chapter. The full Transformer model was shown in Chapter 5, Generating Text with RNNs and GPT-2. We will start with a modified version of the full Transformer. Specifically, we will modify the Encoder part of the Transformer to create a visual Encoder, which takes image data as input instead of text sequences. There are some other small modifications to...