Cosine Similarity
One significant automation of NLP is its capability to find words or documents that are semantically related. In a search engine, we want to find other words that are similar to our search words. What a search engine does is measure the similarity of two words or the similarity of two documents. We cannot compare two words or documents simply using their alphabets. But we can compare two words or documents mathematically in their latent vector space. In Chapter 4, Latent Semantic Analysis with scikit-learn, we learned how to represent documents as vectors. Because documents are represented as vectors, they can be compared mathematically. How do we measure the similarity of two vectors? The measure is called cosine similarity. This measure is widely used in NLP.
Search engines, whether powered by pre-LLM techniques (such as Word2Vec or Doc2Vec) or LLMs (such as BERT word embeddings), use cosine similarity, among other metrics, such as Euclidean distance or Manhattan...