The encoders and decoders we built up to now used RNN-based architectures. Even while discussing attention in the previous section, the attention-based mechanism was used in conjunction with RNN architecture-based encoders and decoders. Transformers approach the problem differently and build the encoders and decoders by using the attention mechanism, doing away with the RNN-based architectural backbones. Transformers have shown themselves to be more parallelizable and require a lot less time for training, thus having multiple benefits over the previous architectures.
Let's try and understand the complex architecture of Transformers next.
Understanding the architecture of Transformers
As in the previous section, Transformer modeling is based on converting a set of input sequences into a bunch of hidden states, which are then further decoded into a set of output sequences. However, the way these encoders and decoders are built is changed when using a Transformer. The...