In Transformers, ‘attention’ is defined as a process where every output element is connected to every input element, and the weightings between them are dynamically calculated based upon the circumstances. Transformers are more flexible than models with fixed connectivity patterns. These Transformers can consume large amounts of memory while being applied to data types with many elements, like images or raw audio.
One way of reducing this memory consumption is by recomputing the attention matrix from checkpoints during backpropagation which is a well-established technique in deep learning for reducing memory usage. However, the major issue with recomputing the attention matrix was that it was reducing memory usage at the cost of more computation and also, it couldn’t deal with large inputs.
To overcome this, the OpenAI researchers introduced Sparse Attention.
For very large inputs, computing a single attention matrix can become impractical. The OpenAI researchers instead opted for sparse attention patterns, where each of the output position computes weightings from a subset of input positions. In the entire process, the researchers first visualized the learned attention patterns for deep Transformers on images and then found out that many showed interpretable and structured sparsity patterns. The team also realized that the input portions are focused on small subsets and they show a high degree of regularity.
The researchers also implemented a two-dimensional factorization of the attention matrix, where the network can attend to all positions through two steps of sparse attention. They implemented it to preserve the ability of their network to learn new patterns.
The first version is strided attention which is roughly equivalent to each position attending to its row and its column and is a bit similar to the attention pattern.
The second version is fixed attention which attends to a fixed column and the elements after the latest column element. According to the researchers, it is a useful pattern and can be used when the data doesn’t fit into a two-dimensional structure.
The researchers test their architecture on density modeling tasks including natural images, text, and raw audio using CIFAR-10, Enwik8, and Imagenet 64 datasets respectively.. The team trained strided Sparse Transformers on CIFAR-10 images represented as sequences of 3072 bytes. They also trained models on the EnWik8 dataset for representing the first 108 bytes of Wikipedia containing variability in the periodic structure. They further trained on the version of downsampled ImageNet 64.
The researchers found out that sparse attention achieved lower loss than full attention and it is also faster.
According to the researchers, the sparse attention patterns are only preliminary steps in the direction of efficient modeling of long sequences. The researchers think that exploring different patterns and combinations of sparsity is useful and learning sparse patterns is a promising avenue of research for the next generation of neural network architectures.
According to them, the autoregressive sequence generation still seems impractical for very high-resolution images or video. The optimized attention operations may prove to be useful for modeling high dimensional data, like multi-scale approaches.
This is just an overview of the Sparse Transformer architecture. For more detailed information, we recommend you to read the research paper.
OpenAI Five bots destroyed human Dota 2 players this weekend
OpenAI Five beats pro Dota 2 players; wins 2-1 against the gamers
OpenAI introduces Neural MMO, a multiagent game environment for reinforcement learning agents