- The steps involved in the self-attention mechanism are given here:
- First, we compute the dot product between the query matrix and the key matrix
and get the similarity scores.
- Next, we divide
by the square root of the dimension of the key vector
.
- Then, we apply the softmax function to normalize the scores and obtain the score matrix
.
- Finally, we compute the attention matrix
by multiplying the score matrix with the value matrix
.
- The self-attention mechanism is also called scaled dot product attention, since here we are computing the dot product (between the query and key vector) and scaling the values (with
).
- To create query, key, and value matrices, we introduce three new weight matrices called
. We create the query
, key
, and value
matrices, by multiplying the input matrix
by
,
, and
, respectively.
- If we were to pass the preceding input matrix directly to the transformer, it would not understand the word order. So, instead of feeding...