Interpreting attention heads
As with most deep learning (DL) architectures, both the success of Transformer models and how they learn have been not fully understood, but we know that Transformers—remarkably—learn many linguistic features of the language. A significant amount of learned linguistic knowledge is distributed both in the hidden state and in the self-attention heads of the pretrained model. Recent substantial studies have been published and many tools have been developed to understand and better explain the phenomena.
Thanks to some natural language processing (NLP) community tools, we can interpret the information learned by the self-attention heads in a Transformer model. The heads can be interpreted naturally, thanks to the weights between tokens. We will soon see that in further experiments in this section, certain heads correspond to a certain aspect of syntax or semantics. We can also observe surface-level patterns and many other linguistic features...