Indexing
The next few steps represent the indexing stage, where we obtain our target data, pre-process it, and vectorize it. These steps are often done offline, meaning they are done to prepare the application for usage later. But in some cases, it may make sense to do this all in real time, such as in rapidly changing data environments where the data that is used is relatively small. In this particular example, the steps are as follows:
- Web loading and crawling.
- Splitting the data into digestible chunks for the Chroma DB vectorizing algorithm.
- Embedding and indexing those chunks.
- Adding those chunks and embeddings to the Chroma DB vector store.
Let’s start with the first step: web loading and crawling.
Web loading and crawling
To start, we need to pull in our data. This could be anything of course, but we have to start somewhere!
For our example, I am providing a web page example based on some of the content from Chapter 1. I have adopted...