Code lab 11.2 – Text splitters
The file you need to access from the GitHub repository is titled CHAPTER11-2_TEXT_SPLITTERS.ipynb
.
Text splitters split a document into chunks that can be used for retrieval. Larger documents pose a threat to many parts of our RAG application and the splitter is our first line of defense. If you were able to vectorize a very large document, the larger the document, the more context representation you will lose in the vector embedding. But this assumes you can even vectorize a very large document, which you often can’t! Most embedding models have relatively small limits on the size of documents we can pass to it compared to the large documents many of us work with. For example, the context length for the OpenAI model we are using to generate our embeddings is 8,191 tokens. If we try to pass a document larger than that to the model, it will generate an error. These are the main reasons splitters exist, but these are not the only complexities...