Summary
New transformer models keep appearing on the market. Therefore, it is good practice to keep up with cutting-edge research by reading publications and books and testing some systems.
This leads us to assess which transformer models to choose and how to implement them. We cannot spend months exploring every model that appears on the market. We cannot change models every month if a project is in production. Industry 4.0 is moving to seamless API ecosystems.
Learning all the models is impossible. However, understanding a new model quickly can be achieved by deepening our knowledge of transformer models.
The basic structure of transformer models remains unchanged. The layers of the encoder and/or decoder stacks remain identical. The attention head can be parallelized to optimize computation speeds.
The Reformer model applies LSH buckets and chunking. It also recomputes each layer’s input instead of storing the information, thus optimizing memory issues. However...