Summary
Transformer models are trained to resolve word-level polysemy disambiguation
low-level, mid-level, and high-level dependencies. The process is achieved by connecting training million- to trillion-parameter models. The task of interpreting these giant models seems daunting. However, several tools are emerging.
We first installed BertViz
. We learned how to interpret the computations of the attention heads with an interactive interface. We saw how words interacted with other words for each layer.
The chapter continued by defining the scope of probing and non-probing tasks. Probing tasks such as NER provide insights into how a transformer model represents language. However, non-probing methods analyze how the model makes predictions. For example, LIT plugged a PCA project and UMAP representations into the outputs of a BERT transformer model. We could then analyze clusters of outputs to see how they fit together.
Finally, we ran transformer visualization via dictionary...