Designing the LLM Twin’s data collection pipeline
Before digging into the implementation, we must understand the LLM Twin’s data collection ETL architecture, illustrated in Figure 3.1. We must explore what platforms we will crawl to extract data from and how we will design our data structures and processes. However, the first step is understanding how our data collection pipeline maps to an ETL process.
An ETL pipeline involves three fundamental steps:
- We extract data from various sources. We will crawl data from platforms like Medium, Substack, and GitHub to gather raw data.
- We transform this data by cleaning and standardizing it into a consistent format suitable for storage and analysis.
- We load the transformed data into a data warehouse or database.
For our project, we use MongoDB as our NoSQL data warehouse. Although this is not a standard approach, we will explain the reasoning behind this choice shortly.
Figure 3.1:...