Code lab 11.1 – Document loaders
The file you need to access from the GitHub repository is titled CHAPTER11-1_DOCUMENT_LOADERS.ipynb
.
Document loaders play a key role in accessing, extracting, and pulling in the data that makes our RAG application function. Document loaders are used to load and process documents from various sources such as text files, PDFs, web pages, or databases. They convert the documents into a format suitable for indexing and retrieval.
Let’s install some new packages to support our document loading, which, as you might have guessed, involves some different file format-related packages:
%pip install bs4 %pip install python-docx %pip install docx2txt %pip install jq
The first one may look familiar, bs4
(for Beautiful Soup 4), as we used it in Chapter 2 for parsing HTML. We also have a couple of Microsoft Word-related packages, such as python_docx
, which helps with creating and updating Microsoft Word (.docx
) files, and docx2txt
, which...