Structuring your data pipeline project
At a high level, our data pipeline will run weekly, collecting data for the preceding 7 days and storing it in a way that can be run by machine learning jobs to generate models upstream. We will structure our data folders into three types of data:
- Raw data: A dataset generated by retrieving data from the Yahoo Finance API for the last 90 days. We will store the data in CSV format – the same format that it was received in from the API. We will log the run in MLflow and extract the number of rows collected.
- Staged data: Over the raw data, we will run quality checks, schema verification, and confirm that the data can be used in production. This information about data quality will be logged in MLflow Tracking.
- Training data: The training data is the final product of the data pipeline. It must be executed over data that is deemed as clean and suitable to execute models. The data contains the data processed into features that...