Embedding Models
Embedding models are powerful machine learning techniques that simplify high-dimensional data into lower-dimensional space, while preserving essential features. Crucial in natural language processing (NLP), they transform sparse word representations into dense vectors, capturing semantic similarities between words. Embedding models also process images, audio, video, and structured data, enhancing applications in recommendation systems, anomaly detection, and clustering.
Here is an example of an embedding model in action. Suppose the full plot in a database of movies has been previously embedded using OpenAI’s text-embedding-ada-002
embedding model. Your goal is to find all movies and animations for Guardians of the Galaxy, but not by traditional phonetic or lexical matching (where you would type some of the words in the title). Instead, you will search by semantic means, say, the phrase Awkward team of space defenders
. You will then use the same embedding model again to embed this phrase and query the embedded movie plots. Table 4.1 shows an excerpt of the resulting embedding:
Dimension |
Value |
1 |
0.00262913 |
2 |
0.031449784 |
3 |
0.0020321296 |
... |
... |
1535 |
-0.01821267 |
1536 |
0.0014683881 |
Table 4.1: Excerpt of embedding
This chapter will help you understand embedding models in depth. You’ll also implement an example using the Python language and the langchain-openai
library.
This chapter will cover the following topics:
- Differentiation between embedding models and LLMs
- Types of embedding models
- How to choose an embedding model
- Vector representations