Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Building AI Intensive Python Applications

You're reading from   Building AI Intensive Python Applications Create intelligent apps with LLMs and vector databases

Arrow left icon
Product type Paperback
Published in Sep 2024
Publisher Packt
ISBN-13 9781836207252
Length 298 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Toc

Table of Contents (18) Chapters Close

Preface 1. Chapter 1: Getting Started with Generative AI 2. Chapter 2: Building Blocks of Intelligent Applications FREE CHAPTER 3. Part 1: Foundations of AI: LLMs, Embedding Models, Vector Databases, and Application Design
4. Chapter 3: Large Language Models 5. Chapter 4: Embedding Models 6. Chapter 5: Vector Databases 7. Chapter 6: AI/ML Application Design 8. Part 2: Building Your Python Application: Frameworks, Libraries, APIs, and Vector Search
9. Chapter 7: Useful Frameworks, Libraries, and APIs 10. Chapter 8: Implementing Vector Search in AI Applications 11. Part 3: Optimizing AI Applications: Scaling, Fine-Tuning, Troubleshooting, Monitoring, and Analytics
12. Chapter 9: LLM Output Evaluation 13. Chapter 10: Refining the Semantic Data Model to Improve Accuracy 14. Chapter 11: Common Failures of Generative AI 15. Chapter 12: Correcting and Optimizing Your Generative AI Application 16. Other Books You May Enjoy Appendix: Further Reading: Index

Data modeling

This section delves into the diverse types of data required by AI/ML systems, including structured, unstructured, and semi-structured data, and how these are applied to MDN’s news articles. The following are short descriptions of each to set a basic understanding:

  • Structured data conforms to a predefined schema and is traditionally stored in relational databases for transactional information. It powers systems of engagement and intelligence.
  • Unstructured data includes binary assets, such as PDFs, images, videos, and others. Object stores such as Amazon S3 allow storing these under a flexible directory structure at a lower cost.
  • Semi-structured data, such as JSON documents, allow each document to define its schema, accommodating both common and unique data points, or even the absence of some data.

MDN will store news articles, subscriber profiles, billing information, and more. For simplicity, in this chapter, you will focus on the data about each news article and related binary content (which would be images). Figure 6.1 describes the data model of the articles collection.

Figure 6.1: Schema for the articles collection

The articles collection represents a news article with metadata, including creation details, tags, and contributors. All documents feature a title, summary, body content in HTML and plain text, and associated media elements such as images.

Enriching data with embeddings

To complete the MDN data model, you need to consider data that will also be represented and stored via embeddings. Text embeddings for article titles and summaries will enable semantic search, while image embeddings will help find similar artwork used across articles. Table 6.1 describes the data fields, embedding models to use, and their vector sizes.

Type

Field(s)

Embedding model

Vector size

Text

title

OpenAI text-embedding-3-large

1,024

Text

summary

Image

contents

OpenAI CLIP

768

Table 6.1: Embeddings for the articles collection

Each article has a title and summary. Instead of embedding them separately, you will concatenate them and create one text embedding for simplicity. Ideally, for images, you would store the embedding with each content object in the contents array. However, support for fields inside arrays of objects for vector indexes is not available today in MongoDB Atlas and leads to the anti-pattern of bloated documents. The best practice is to store image embeddings in a separate collection and use the extended reference schema design pattern. You can learn more about indexing arrays with MongoDB, bloated documents, and the extended reference pattern from the links given in the Further Reading chapter of this book. Figure 6.2 shows the updated data model.

Figure 6.2: Schema for articles with embeddings

Table 6.2 shows the corresponding vector indexes.

Collection: articles

Collection: article_content_embeddings

Vector index: semantic_embedding_vix

Vector index: content_embedding_vix

{
  "fields": [
	{
  	"numDimensions": 1024,
  	"path": "semantic_embedding",
  	"similarity": "cosine",
  	"type": "vector"
	}
  ]
}
{
  "fields": [
	{
  	"numDimensions": 768,
  	"path": "content_embedding",
  	"similarity": "cosine",
  	"type": "vector"
	}
  ]
}

Table 6.2: Vector search index definitions

Considering search use cases

Before finalizing the data model, let’s consider search use cases for articles, and adapt the model once more. Here are some broader search use cases:

  • Find articles by matching lexically or semantically on title and summary, allowing filtering by brand and subscription type: This use case is called hybrid search and is covered in Chapter 5, Vector Databases. It combines semantic and lexical searches using reciprocal rank fusion. You can create a search index covering the title and summary fields for text search, and the brand and subscription_type fields for filtering.
  • Same as the first one and extend to include tags: For this use case, you can use the same index and add the tags field. You will also need a vector search index to cover the title + summary embedding.
  • Find other articles that use similar images, filtering by brand and subscription type: For this use case, vector search indexes on MongoDB Atlas support adding traditional fields for filtering. Since the image embeddings are stored in another collection, you will need to duplicate the article’s _id, brand, and subscription_type fields from the articles collection into the article_content_embeddings collection. Since there is already an _id field in this collection, you can create a composite primary key that includes the _id of the article and the _id of the content. Figure 6.3 shows the updated data model.

Figure 6.3: Updated schema for articles with embeddings

Table 6.3 shows updated vector indexes.

Collection: articles

Collection: article_content_embeddings

Vector index: semantic_embedding_vix

Vector index: content_embedding_vix

{
  "fields": [
    {
      "numDimensions": 1024,
      "path": "semantic_embedding",
      "similarity": "cosine",
      "type": "vector"
    },
    {
      "path": "brand",
      "type": "filter"
    },
    {
      "path": "subscription_type",
      "type": "filter"
    }
  ]
}
{
  "fields": [
    {
      "numDimensions": 768,
      "path": "content_embedding",
      "similarity": "cosine",
      "type": "vector"
    },
    {
      "path": "brand",
      "type": "filter"
    },
    {
      "path": "subscription_type",
      "type": "filter"
    },
    {
      "path": "_id.article_id",
      "type": "filter"
    }
  ]
}

Table 6.3: Updated vector search index definitions

Table 6.4 shows the new text search index.

Collection: articles

Search index: lexical_six

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "brand": {
        "normalizer": "lowercase",
        "type": "token"
      },
      "subscription_type": {
        "normalizer": "lowercase",
        "type": "token"
      },
      "summary": {
        "type": "string"
      },
      "tags": {
        "normalizer": "lowercase",
        "type": "token"
      },
      "title": {
        "type": "string"
      }
    }
  }
}

Table 6.4: Text search index definition

You learned about writing vector search queries in Chapter 4, Embedding Models. To learn more about hybrid search queries, you can refer to the tutorial at https://www.mongodb.com/docs/atlas/atlas-vector-search/tutorials/reciprocal-rank-fusion/.

Now that you understand your data model and the indexes required, you need to consider the number of articles MDN will bear (including the sizes of embeddings and indexes), peak daily times, and more to determine the overall storage and database cluster requirements.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image