- The steps involved in the self-attention mechanism are given here:
- First, we compute the dot product between the query matrix and the key matrix and get the similarity scores.
- Next, we divide by the square root of the dimension of the key vector .
- Then, we apply the softmax function to normalize the scores and obtain the score matrix .
- Finally, we compute the attention matrix by multiplying the score matrix with the value matrix .
- The self-attention mechanism is also called scaled dot product attention, since here we are computing the dot product (between the query and key vector) and scaling the values (with ).
- To create query, key, and value matrices, we introduce three new weight matrices called . We create the query , key , and value matrices, by multiplying the input matrix by ,, and , respectively.
- If we were to pass the preceding input matrix directly to the transformer, it would not understand the word order. So, instead of feeding...