Chapter 10 – Versioning and Reproducible Machine Learning Modeling
- MLflow: We introduced MLflow for experiment tracking and model monitoring in previous chapters, but you can also use it for data versioning (https://mlflow.org/).
DVC: An open source version control system for managing data, code, and ML models. It is designed to handle large datasets and integrates with Git (https://dvc.org/).
Pachyderm: A data versioning platform that provides reproducibility, provenance, and scalability in machine learning workflows (https://www.pachyderm.com/).
- No. Different versions of the same data file could be stored with the same name and restored and retrieved when needed.
- A simple change of the random state when splitting data into training and test sets or during model initialization could result in different parameter values and performances for training and evaluation sets.