We learned how to extract the embedding from the pre-trained BERT model in the previous section. We learned that they are the embeddings obtained from the final encoder layer. Now the question is, should we consider the embeddings obtained only from the final encoder layer (final hidden state), or should we also consider the embeddings obtained from all the encoder layers (all hidden states)? Let's explore this.
Let's represent the input embedding layer with , the first encoder layer (first hidden layer) with , the second encoder layer (second hidden layer) with , and so on to the final twelfth encoder layer, , as shown in the following figure:
Instead of taking the embeddings (representations) only from the final encoder layer, the researchers of BERT have experimented with taking embeddings from different encoder layers.
For instance, for NER task, the researchers have used the pre...