What is an embedding model?
Embedding models are a type of tool used in machine learning and artificial intelligence that simplifies large and complex data into a more manageable form. This process, known as embedding, involves reducing the data’s dimensions.
Imagine going from a detailed world map with highways, railroads, rivers, trails, and so on, to a simpler, summarized version with only country boundaries and capital cities. This not only makes computation faster and less resource-intensive, but also helps identify and understand relationships within the data. Because embedding models streamline the processing and analyzing of large datasets, they are particularly useful in areas of language (text) processing, image and sound recognition, and recommendation systems.
Consider a vast library where each book stands for one point in high dimensions. Embedding models can help reorganize the library to improve ease of navigation, such as by grouping the books on related topics closer together and reducing the library’s overall size. Figure 4.1 illustrates this concept:
Figure 4.1: An embedding model example for a library use case
This conversion or reduction from a high-dimensional or original representation to a lower-dimensional representation created the basis for advancements in NLP, computer vision, and more.
How do embedding models differ from LLMs?
Embedding models are specialized algorithms that reduce high-dimensional data (such as text, images, or sound) into a low-dimensional space of dense vectors. On the other hand, LLMs are effective artificial neural networks pre-trained on gigantic corpora of textual data.
While both are rooted in neural networks, they employ distinct methodologies. LLMs are designed for generating coherent and contextually relevant text. LLMs leverage massive amounts of data to understand and predict language patterns. Their basic building blocks include transformer architectures, attention mechanisms, and large-scale pre-training followed by fine-tuning.
In contrast, embedding models focus on mapping words, phrases, or even entire sentences into dense vector spaces where semantic relationships are preserved. They often use techniques such as contrastive loss, which helps in distinguishing between similar and dissimilar pairs during training. Positive and negative sampling is another technique employed by embedding models. Positive samples are similar items (such as synonyms or related sentences), while negative samples are dissimilar items (such as unrelated words or sentences). Figure 4.2 visualizes an example of contrastive loss and positive and negative sampling in 2D space. This sampling aids the model in learning meaningful representations by minimizing the distance between positive pairs and maximizing the distance between negative pairs in the vector space.
Figure 4.2: 2D visualization of contrastive loss and positive and negative sampling
To summarize, while LLMs excel in language generation tasks, embedding models are optimized for capturing and leveraging semantic similarities. Both enhance NLP by enabling machines to grasp and produce human language more effectively. Now, let’s look at an example of each.
Word2vec (developed by Google) transforms words into vectors and discerns semantic relationships, such as “king” is to “man” as “queen” is to “woman.” It’s useful for sentiment analysis, translation, and content recommendations, enhancing natural language understanding for machines.
GPT-4 (developed by OpenAI) is an LLM that is characterized by its ability to generate human-like text based on the input it receives. GPT-4 excels in a range of language-based tasks, including conversation, content generation, summarization, and translation. Its architecture allows it to comprehend the intricate details and nuances of language, enabling it to perform tasks that require a deep understanding of context, humor, irony, and cultural references.
When to use embedding models versus LLMs
Embedding models are used in scenarios where the goal is to capture and leverage the relationships within data. They are the ideal choice for the following tasks:
- Semantic similarity: Finding or recommending items (such as documents or products) that are like a given item.
- Clustering: Grouping entities based on their semantic properties.
- Information retrieval: Enhancing search functionalities by understanding the semantic content of queries.
LLMs are the go-to for tasks that require text understanding, generation, or both, such as the following:
- Content creation: Generating text that is coherent, contextually relevant, and stylistically appropriate. For example, generating a synopsis from the full plot of a movie.
- Conversational AI: Building chatbots and virtual assistants that can understand and engage in human-like dialogue, such as answering questions about employment policies and employee benefits.
- Language translation: The extensive training on language-diverse datasets allows LLMs to handle idiomatic expressions, cultural nuances, and specialized terminology.
Embedding models and LLMs both play crucial roles in AI. Embedding models capture and manipulate semantic properties compactly, while LLMs excel in generating and interpreting text. Using both, and selecting the right embedding models based on your goals, can unlock AI’s full potential in your projects.
Types of embedding models
Word-level models, including Global Vectors for Word Representation (GloVe) and Bidirectional Encoder Representations from Transformers (BERT), capture broader textual meanings. Specialized models such as fastText adapt to linguistic challenges. All of these reflect the evolving landscape of embedding models.
In this section, you will explore many types of embedding models: word, sentence, document, contextual, specialized, non-text, and multi-modal.
Word embeddings
Word embedding models capture semantic meanings based on context within extensive text corpora. One common approach involves a neural network that learns word associations either by predicting a word from its surrounding context or vice versa. Another method combines matrix factorization with context window techniques to generate embeddings by summarizing word co-occurrence frequencies in large matrices. A further enhancement treats each word as a collection of character n-grams (a sequence of n
adjacent symbols in a particular order), which helps to better handle prefixes, suffixes, and rare words. Word2vec and GloVe are examples of these models.
Word2vec was the first attempt of embedding models to learn the representation of words as vectors based on their contextual similarities. Developed by a team from Google, it uses two architectures: Continuous Bag of Words (CBOW), which predicts a word given a context, and skip-gram, which predicts a context for a given word. Word2vec has been seen to capture the relationship in the syntax of words, evidenced by its ability to deduce meanings from arithmetic operations performed with word vectors.
GloVe, developed at Stanford University, merges the benefits of two leading word representation approaches: global matrix factorization with co-occurrence statistics and context window methods. By constructing a co-occurrence matrix from the corpus and applying dimensionality reduction techniques, GloVe captures both global statistics and local context, which is invaluable for tasks that require a deep understanding of word relationships.
Sentence and document embeddings
Sentence and document embedding models capture the overall semantic meaning of text blocks by considering word context and arrangement. A common approach aggregates word vectors into a coherent vector for the whole text unit. These models are useful in document similarity, information retrieval, and text summarization, such as synopses versus full movie plots. Notable models include Doc2vec and BERT.
Building on Word2vec, Doc2vec, which is also known as Paragraph Vector, encapsulates whole sentences or documents as vectors. Introducing a document ID token that allows the model to learn document-level embeddings alongside word embeddings aids significantly in tasks such as document classification and similarity comparison.
Google’s BERT employs context-aware embeddings, reading the entire sequence of words concurrently, unlike its predecessors that processed text linearly. This approach enables BERT to understand a word’s context from all surrounding words, resulting in more dynamic and nuanced embeddings and setting new standards across various NLP tasks.
Contextual embeddings
Contextual embedding models are designed to produce word vectors that vary according to the context of use in a sentence. These models use deep learning architectures by examining the whole sentence, or at times the surrounding sentences. The contextual model produces dynamic embeddings that capture nuances based on a word’s particular context and linguistic environment. A model architecture of this kind uses a bi-directional framework to process text both forward and in reverse, thereby capturing fine semantic and syntactic dependencies within the preceding and following contexts. They are useful in sentiment analysis (such as to interpret the tone of the text in an IT support ticket) and question-answering tasks where the exact meaning of words for interpretation is necessary. ELMo and GPT are two examples.
Embeddings from Language Models (ELMo) introduced dynamic, context-dependent embeddings, producing variable embeddings based on a word’s linguistic context. This approach greatly enhances performance on downstream NLP tasks by providing a richer language understanding.
OpenAI’s GPT series leverages transformer technology to offer embeddings pre-trained on extensive text corpora and fine-tuned for specific tasks. GPT’s success underscores the efficacy of combining large-scale language models with transformer architectures in NLP.
Specialized embeddings
Specialized embedding models capture specific linguistic properties, such as places, people, tone, and mood, in vector space. Some are language- or dialect-specific, while others analyze sentiment and emotional dimensions. Applications include legal document analysis, support ticket triage, sentiment analysis in marketing, and multilingual content management.
fastText is an example of a specialized embedding model. Developed by Facebook’s AI Research lab, fastText enhances Word2vec by treating words as bags of character n-grams, which proves particularly helpful for handling out-of-vocabulary (OOV) words. OOV words are words not seen during training and thus lack pre-learned vector representations, posing challenges for traditional models. fastText enables embeddings for OOV words through the summation of their sub-word embeddings. This makes it especially suitable for handling rare words and morphologically complex languages, which are languages with rich and varied word structures that use extensive prefixes, suffixes, and inflections to convey different grammatical meanings, such as Finnish, Turkish, and Arabic.
Other non-text embedding models
Embedding models go beyond converting only text to vector representations. Images, audio, video, and even JSON data itself can be represented in vector form:
- Images: Models such as Visual Geometry Group (VGG) and Residual Network (ResNet) set benchmarks for the translation of raw images into dense vectors. These models capture important visual features, such as edges, textures, and color gradients, which are vital to many computer vision tasks, including image classification and object recognition. VGG works well at recognizing visual patterns, while ResNet improves accuracy in complex image-processing tasks, such as image segmentation or photo tagging.
- Audio: OpenL3 and VGGish are models for audio. OpenL3 is a model adapted from the L3-Net architecture that is used in audio event detection and environmental sound classification to embed audio into a temporal and spectral context-rich space. VGGish is born out of the VGG architecture for images, and so follows the same principle of converting sound waves into patterns of small, compact vectors. This simplifies tasks such as recognition of speech and music genres.
- Video: 3D Convolutional Neural Networks (3D CNNs or 3D ConvNets) and Inflated 3D (I3D) expand the capabilities of image embeddings in perceiving the temporal dynamics paramount to both action recognition and for video content analysis. 3D ConvNets apply convolutional filters in three dimensions (height, width, time) capturing spatial and temporal dependencies in volumetric data, making them particularly effective for spatiotemporal data, such as video analysis, medical imaging, and 3D object recognition. I3D uses a spatiotemporal architecture that combines the outputs of two 3D ConvNets: one processes RGB frames, while the other handles optical flow predictions between consecutive frames. I3D models are useful for sports analytics and surveillance systems.
- Graph data: Node2vec and DeepWalk capture connectivity patterns of nodes within a graph and are applied in the domains of social network analysis, fraud detection, and recommendation systems. Node2vec learns continuous vector representations for nodes by performing biased random walks on the graph. This captures the diverse node relationships and community structures, improving the performance of tasks such as node classification and link prediction. DeepWalk treats random walks as sequences of nodes like sentences in NLP by capturing the structural relationships between nodes and encodes them into continuous vector representations, which can be used for node classification and clustering.
- JSON data: There are even JSON data embedding models, such as Tree-LSTM, which is a variation of the traditional long short-term memory (LSTM) networks, adapted specifically to handle data with a hierarchical tree structure, such as JSON. Unlike standard LSTM units that process data sequentially, Tree-LSTM operates over tree-structured data by incorporating states from multiple child nodes into a parent node, effectively capturing the dependencies in nested structures. This makes it particularly suitable for tasks such as semantic parsing and sentiment analysis, where understanding the hierarchical relationships within data can significantly improve performance. json2vec is an implementation of this kind of embedding model.
After single-mode models, you can explore multi-modal models. These analyze multiple data types simultaneously and are crucial for applications such as autonomous driving, where merging data from sensors, cameras, and LiDAR builds a comprehensive view of the driving environment.
Multi-modal models
Multi-modal embedding models process and integrate information from many types of data sources into a unified embedding space. This approach is incredibly useful when different modalities complement or reinforce each other and together can lead to better AI applications. Multi-modal models are excellent for in-depth comprehension of multisensory input content, such as the tasks of multi-media search engines, automated content moderation, and interactive AI systems that can engage the user via visual and verbal interaction. Here are a few examples:
- CLIP: A well-known multi-modal model by OpenAI. It learns how to correlate visual images with textual descriptions in such a way that it can recognize images it has never seen during training, based on natural language queries.
- LXMERT: A model that focuses on processing both visual and text inputs. It can improve the performance of tasks such as answering questions with a visual aspect, which includes object detection.
- ViLBERT: Vision-and-Language BERT (ViLBERT) extends the BERT architecture to process both visual and textual inputs simultaneously by using a two-stream model where one stream handles visual features extracted from images using a pre-trained convolutional neural network (CNN or ConvNet), and the other processes textual data with cross-attention layers facilitating interaction between the two modalities. ViLBERT is used for tasks such as visual question answering and visual commonsense reasoning, where understanding image-text relationships is essential.
- VisualBERT: Integrates visual and textual information by combining image features with contextualized word embeddings from a BERT-like architecture. It is commonly used for tasks such as image-text retrieval and image captioning, where aligning and understanding both visual and textual information are essential.
You have now explored word, image, and multi-modal embeddings. Next, you’ll learn how to choose embedding models based on your application’s needs.