Introduction to sentence embeddings
Pre-trained BERT models do not produce efficient and independent sentence embeddings as they always need to be fine-tuned in an end-to-end supervised setting. This is because we can think of a pre-trained BERT model as an indivisible whole and semantics is spread across all layers, not just the final layer. Without fine-tuning, it may be ineffective to use its internal representations independently. It is also hard to handle unsupervised tasks such as clustering, topic modeling, information retrieval, or semantic search. Because we have to evaluate many sentence pairs during clustering tasks, for instance, this causes massive computational overhead.
Luckily, many modifications have been made to the original BERT model, such as Sentence-BERT (SBERT), to derive semantically meaningful and independent sentence embeddings. We will talk about these approaches in a moment. In the NLP literature, many neural sentence embedding methods have been proposed...