Working with efficient self-attention
Efficient approaches restrict the attention mechanism to get an effective transformer model because the computational and memory complexity of a transformer is mostly due to the self-attention mechanism. The attention mechanism scales quadratically with respect to the input sequence length. For short input, quadratic complexity may not be an issue. However, to process longer documents, we need to improve the attention mechanism that scales linearly with sequence length.
We can roughly group the efficient attention solutions into three types:
- Sparse attention with fixed patterns
- Learnable sparse patterns
- Low-rank factorization/kernel function
Let's begin with sparse attention based on a fixed pattern next.
Sparse attention with fixed patterns
Recall that the attention mechanism is made up of a query, key, and values as roughly formulated here:
Here, the Score
function, which is mostly softmax, performs...