Data modeling
This section delves into the diverse types of data required by AI/ML systems, including structured, unstructured, and semi-structured data, and how these are applied to MDN’s news articles. The following are short descriptions of each to set a basic understanding:
- Structured data conforms to a predefined schema and is traditionally stored in relational databases for transactional information. It powers systems of engagement and intelligence.
- Unstructured data includes binary assets, such as PDFs, images, videos, and others. Object stores such as Amazon S3 allow storing these under a flexible directory structure at a lower cost.
- Semi-structured data, such as JSON documents, allow each document to define its schema, accommodating both common and unique data points, or even the absence of some data.
MDN will store news articles, subscriber profiles, billing information, and more. For simplicity, in this chapter, you will focus on the data about each news article and related binary content (which would be images). Figure 6.1 describes the data model of the articles
collection.
Figure 6.1: Schema for the articles collection
The articles collection represents a news article with metadata, including creation details, tags, and contributors. All documents feature a title, summary, body content in HTML and plain text, and associated media elements such as images.
Enriching data with embeddings
To complete the MDN data model, you need to consider data that will also be represented and stored via embeddings. Text embeddings for article titles and summaries will enable semantic search, while image embeddings will help find similar artwork used across articles. Table 6.1 describes the data fields, embedding models to use, and their vector sizes.
Type |
Field(s) |
Embedding model |
Vector size |
Text |
|
OpenAI |
1,024 |
Text |
|
||
Image |
|
OpenAI CLIP |
768 |
Table 6.1: Embeddings for the articles collection
Each article has a title and summary. Instead of embedding them separately, you will concatenate them and create one text embedding for simplicity. Ideally, for images, you would store the embedding with each content object in the contents
array. However, support for fields inside arrays of objects for vector indexes is not available today in MongoDB Atlas and leads to the anti-pattern of bloated documents. The best practice is to store image embeddings in a separate collection and use the extended reference schema design pattern. You can learn more about indexing arrays with MongoDB, bloated documents, and the extended reference pattern from the links given in the Further Reading chapter of this book. Figure 6.2 shows the updated data model.
Figure 6.2: Schema for articles with embeddings
Table 6.2 shows the corresponding vector indexes.
Collection: |
Collection: |
Vector index: |
Vector index: |
{ "fields": [ { "numDimensions": 1024, "path": "semantic_embedding", "similarity": "cosine", "type": "vector" } ] } |
{ "fields": [ { "numDimensions": 768, "path": "content_embedding", "similarity": "cosine", "type": "vector" } ] } |
Table 6.2: Vector search index definitions
Considering search use cases
Before finalizing the data model, let’s consider search use cases for articles, and adapt the model once more. Here are some broader search use cases:
- Find articles by matching lexically or semantically on title and summary, allowing filtering by brand and subscription type: This use case is called hybrid search and is covered in Chapter 5, Vector Databases. It combines semantic and lexical searches using reciprocal rank fusion. You can create a search index covering the
title
andsummary
fields for text search, and thebrand
andsubscription_type
fields for filtering. - Same as the first one and extend to include tags: For this use case, you can use the same index and add the
tags
field. You will also need a vector search index to cover thetitle +
summary
embedding. - Find other articles that use similar images, filtering by brand and subscription type: For this use case, vector search indexes on MongoDB Atlas support adding traditional fields for filtering. Since the image embeddings are stored in another collection, you will need to duplicate the article’s
_id
,brand
, andsubscription_type
fields from thearticles
collection into thearticle_content_embeddings
collection. Since there is already an_id
field in this collection, you can create a composite primary key that includes the_id
of the article and the_id
of the content. Figure 6.3 shows the updated data model.
Figure 6.3: Updated schema for articles with embeddings
Table 6.3 shows updated vector indexes.
Collection: |
Collection: |
Vector index: |
Vector index: |
{ "fields": [ { "numDimensions": 1024, "path": "semantic_embedding", "similarity": "cosine", "type": "vector" }, { "path": "brand", "type": "filter" }, { "path": "subscription_type", "type": "filter" } ] } |
{ "fields": [ { "numDimensions": 768, "path": "content_embedding", "similarity": "cosine", "type": "vector" }, { "path": "brand", "type": "filter" }, { "path": "subscription_type", "type": "filter" }, { "path": "_id.article_id", "type": "filter" } ] } |
Table 6.3: Updated vector search index definitions
Table 6.4 shows the new text search index.
Collection: |
Search index: |
{ "mappings": { "dynamic": false, "fields": { "brand": { "normalizer": "lowercase", "type": "token" }, "subscription_type": { "normalizer": "lowercase", "type": "token" }, "summary": { "type": "string" }, "tags": { "normalizer": "lowercase", "type": "token" }, "title": { "type": "string" } } } } |
Table 6.4: Text search index definition
You learned about writing vector search queries in Chapter 4, Embedding Models. To learn more about hybrid search queries, you can refer to the tutorial at https://www.mongodb.com/docs/atlas/atlas-vector-search/tutorials/reciprocal-rank-fusion/.
Now that you understand your data model and the indexes required, you need to consider the number of articles MDN will bear (including the sizes of embeddings and indexes), peak daily times, and more to determine the overall storage and database cluster requirements.