The main aim of /api/feeder is to receive documents to be indexed, process them, and forward the processed data to Librarian to be added to the index. This means we need to accurately process the document. But what do we mean by "processing a document?"
It can be defined as the following set of consecutive tasks:
- We rely on the payload to provide us with a title and link to the document. We download the linked document and use it in our index.
- The document can be thought of as one big blob of text, and it is possible that we might have multiple documents with the same title. We need to be able to identify each document uniquely and also be able to easily retrieve them.
- The result of a search query expects the provided words to be present in the document. This means we need to extract all words from a document and also...