The generalized attention model
Over the course of years, researchers have come up with different ways of calculating attention weights and using attention in DL models. Sneha Choudhari et al. published a survey paper on attention models that proposes a generalized attention model that tries to incorporate all the variations in a single framework. Let’s structure our discussion around this generalized framework.
We can think of an attention model as learning an attention distribution () for a set of keys, K, using a set of queries, q. In the example we discussed in the last section, the query would be —the hidden state from the last timestep during decoding—and the keys would be —all the hidden states generated using the input sequence. In some cases, the generated attention distribution is applied to another set of inputs called values, V. In many cases, K and V are the same, but to maintain the general form of the framework, we consider these separately...