Semantic search algorithms
We have discussed the concept of a semantic search in depth. Our next step is to walk through the different approaches we can take to conduct a semantic search. These are the actual search algorithms that use things such as the distance metrics we’ve already discussed (Euclidean distance, dot product, and cosine similarity) to conduct their search of dense embeddings. We start with k-nearest neighbors (k-NN).
k-NN
One way to find similar vectors is through brute force. With brute force, you find the distances between the query and all the data vectors. Then, you sort the distances from closest to furthest, returning a certain number of results. You can cut off the results based on a threshold, or you can define a set number to return, such as 5
. The set number is called k
, so you would say k=5
. This is known in classical machine learning as the k-NN algorithm. This is a straightforward algorithm, but its performance degrades as the dataset grows...