Pipeline 1: Collecting and preparing the documents
The code in this section retrieves the metadata we need from Wikipedia, retrieves the documents, cleans them, and aggregates them to be ready for insertion into the Deep Lake vector store. This process is illustrated in the following figure:
Figure 7.4: Pipeline 1 flow chart
Pipeline 1 includes two notebooks:
Wikipedia_API.ipynb
, in which we will implement the Wikipedia API to retrieve the URLs of the pages related to the root page of the topic we selected, including the citations for each page. As mentioned, the topic is “marketing” in our case.Knowledge_Graph_Deep_Lake_LlamaIndex_OpenAI_RAG.ipynb
, in which we will implement all three pipelines. In Pipeline 1, it will fetch the URLs provided by theWikipedia_API
notebook, clean them, and load and aggregate them for upserting.
We will begin by implementing the Wikipedia API.
Retrieving Wikipedia data and metadata
Let’...